Visual Linguistics

Fundamentals of Visualizing Language Data

Noah Bubenhofer (University of Zurich)

Darmstadt, 19 February 2020

A Diagrammatic View on Corpus Linguistics

arm chair linguist

corpus linguist

Butterfly Collection

Foto: Chip Clark, Smithsonian Institution

Data Driven Corpus Linguistics

Ramelli, Agostino: „Le diverse et artificiose machine del capitano Agostino Ramelli : nellequali si contengono varij et industriosi movimenti, degni digrandissima speculatione, per cavarne beneficio infinito in ogni sorte d’operatione : composte in lingua Italiana et Francese“ (1588).

Vincent Placcius, 1689: Illustration featuring a system for filing notes, from De arte excerpendi vom gelahrten Buchhalten liber singularis: quo genera & praecepta excerpendi, ab aliis hucusq[ue]; tradita omnia, novis accessionibus aucta, ordinata methodo exhibentur, et suis quaeque materiis applicantur..., 1689, by Vincent Placcius (1642-1699). *GC6.P6904.689d, Houghton Library, Harvard University

Roberto Busa at the control console of the IBM 705, IBM World Headquarters, 590 Madison Avenue, New York, 1958. [IBM Archives]

Busa (1951) – Sancti Thomae Aquinatis Hymnorum ritualium varia specimina concordantiarum: primo saggio di indici di parole automaticamente composti e stampi da macchine IBM a schede perforate = A 1st example of word index automatically compiled and printed by IBM punched card machines

Busa (1951: 66) – Sancti Thomae Aquinatis Hymnorum ritualium varia specimina concordantiarum: primo saggio di indici di parole automaticamente composti e stampi da macchine IBM a schede perforate = A 1st example of word index automatically compiled and printed by IBM punched card machines

The Mother of All Demos

December 9, 1968, Dr. Douglas C. Engelbart and the Augmentation Research Center (ARC) at Stanford Research Institute

  • …as an intellectual worker
  • …were supplied with a computer display
  • …a computer that was alive for you all day…
  • …and was instantly responsible responsive…
  • …to show you rather than tell you about this program

The Mother of All Demos

December 9, 1968, Dr. Douglas C. Engelbart and the Augmentation Research Center (ARC) at Stanford Research Institute

  • …statements…
  • …do some operations on it…
  • constructing views
  • …organize, categorize, subcategorize (the list)…
  • …makes it very nice for studying
  • …to see the route I have to go… that's my plan for getting home tonight…
  • …just point on it to see
  • …so we have this feature to structure our material…
  • …make different views… modifying the structure…
  • …I have entities of all sorts… (character, words, paragraphs, lines…)

Computers and digital data allow data manipulation: splitting, sorting, decontextualizing, recontextualizing, linking... → diagrammatic operations

From Language Use to an Index

  • Data basis (corpus) used as starting point to study language use.
  • Important: Computational processing of digital data.
  • But: On the foundation of diagrammatic operations: smart arranging and re-arranging of linguistic data!

Diagrams (following Peirce, Stjernfeldt, Krämer etc. [Krämer 2016] – "Diagrams as thinking tools")

Diagrammatic Operations in Corpus Linguistics

Corpus Linguistics

  • Destroying the entity of texts and partitioning it into loci (matches)
  • Ordering loci:
    • Index / register as references to source texts
    • Systematic configuration and ordering using diagrammatic means (concordances, collocation profiles, distributions…)
  • "Reading": search for emergence

➔ Already a simple diagram such as a concordance view allows new perspectives on text.

Platons Μένων

Socrates talking to Menon's slave: What must be done to double the surface of the square?

Diagrams

diagrams to present
diagrams to explore → Visual Analytics
diagrams showing theoretical concepts

Peirce 1992, Stjernfelt 2007, Krämer 2009, Bauer/Ernst 2010, Putzo 2014, Chen et al. 2008, Keim et al. 2010

Diagrams in Linguistics

Diagrammatic Types in Linguistics

list, map, partitura (sheet music, scores), vector, graph

Lists

Maps

Der Zürcher Sommer 1968. Herausgegeben von Joachim Scharloth und Angelika Linke, unter Mitarbeit von Noah Bubenhofer, Susanne Haaf, Céline Jourdain, Monika Schnoz, Ursula Stutz, Peter Zaugg und Angela Zimmermann. Zürich 2008.

Maps

Georg Wenker: Deutscher Sprachatlas – snipped of the map "Kleider (clothes)" (1887-1923)

Quelle: Bubenhofer: Geokollokationen – diskursive Gestaltung von Welt (statistisch signifikante Kollokatoren zu Toponymen)

Graphs

Savigny, Christofle de (1530?-1608?) Auteur du texte: Tableaux accomplis de tous les arts libéraux , contenans... par singulière méthode de doctrine une générale et sommaire partition des dicts arts amassez et réduicts en ordre pour le soulagement et profit de la jeunesse, par M. Christofle de Savigny,... 1587, http://gallica.bnf.fr/ark:/12148/bpt6k122948d (zugegriffen am 2.6.2016).

Graphs

Schleicher, August: Die deutsche Sprache, Cotta 1860: 28.

Partitura

Apel, Willi: The notation of polyphonic music, 900-1600, Cambridge, Mass., Mediaeval Academy of America 1961: 5

Partitura

Apel, Willi: The notation of polyphonic music, 900-1600, Cambridge, Mass., Mediaeval Academy of America 1961: 205

Partitura

Partitura?


Annotations in Text+Berg-Korpus (Bubenhofer et al. 2015)

Visualizations and operations allow to reshape data and give a new perspectives and allow new insights.

Destroying Unity of Text (recontextualization)

Destroying Sequence

Enriching with Dimensions, Linking Data

Example 1: Visualization of Narration

Linguistic Patterns: Example Birth Reports

Bubenhofer, Noah (2018): Serialität der Singularität: Korpusanalyse narrativer Muster in Geburtsberichten. In: Zeitschrift für Literaturwissenschaft und Linguistik, S. 1–32.

Alternative Visualizations

Distributional Semantics / Word Embeddings

14.000 Birth Reports by Mothers: Word Embeddings

Wehe: anrollen (0.7714179754257202), Hammerwehe (0.7110541462898254), wehe. (0.6944172978401184), Wehe. (0.6870291829109192), Wehenpause (0.6566441059112549), Nächste (0.6404309868812561), Welle (0.6262998580932617), Preßwehe (0.6261609792709351), Ausatmen (0.617641270160675), weglegen (0.6134337186813354)

Schmerz: Wehenschmerz (0.8821748495101929), schmerzen (0.746356725692749), Schmerzen. (0.7165833115577698), Dauerschmerz (0.6974124908447266), Bewegungsdrang (0.6873125433921814), wehenschmerz (0.6761729717254639), schmerzen. (0.6695544123649597), Geburtsschmerz (0.6639039516448975), Wehenspitzen (0.6634974479675293), Rückenwehen (0.6496166586875916)

14.000 Birth Reports by Mothers: Word Embeddings

Kind: Baby (0.8563555479049683), Tochter (0.7126445770263672), Sohn (0.6942390203475952), Kleine (0.6648626327514648), Zwilling (0.6449035406112671), Kindlein (0.6330938935279846), Zwerg (0.6289287805557251), Maus (0.6128379106521606), Butzi (0.6042927503585815), Babies (0.60008704662323)

weil: deswegen (0.7363709211349487), sodass (0.6999390125274658), da (0.6969295740127563), außerdem (0.6948705911636353), deshalb (0.6797378063201904), zumal (0.6710671782493591), obwohl (0.6692008376121521), nämlich (0.6597287654876709), aber (0.6589906811714172), daher (0.6349456310272217)

extrem: immens (0.7714534997940063), enorm (0.7656272053718567), ziemlich (0.7187795639038086), wahnsinnig (0.707932710647583), derb (0.7021247744560242), sehr (0.6977962255477905), Rippenschmerzen (0.696819543838501), paaren (0.694392204284668), mörderisch (0.6901798248291016), massiv (0.6896220445632935)

14.000 Birth Reports by Mothers: Clusters of Word Embeddings

Cluster ID 244:

Aerztin, Arzt, Assistenzarzt, Assistenzärztin, Chefarzt, Chefärztin, Doc, OA, Oberarzt, Oberärztin, OÄ, Ärtzin, Ärztin

Label (mit Distanz zum Zentroid): Oberarzt (0.9604488611221313), Ärztin (0.9517141580581665), Oberärztin (0.9398374557495117)

14.000 Birth Reports by Mothers: Clusters of Word Embeddings

Cluster ID 248:

Durchatmen, Dösen, Einschlafen, Hyperventilieren, Ruhepause, Sekundenschlaf, hindurch, immerwieder, manchmal, rauben, sogar, totmüde, wegdösen, weggedämmert, weggedöst, weggenickt, zeitweise, zwischen

Label (mit Distanz zum Zentroid): wegdösen (0.9144155979156494), weggedämmert (0.9106881022453308), Sekundenschlaf (0.8776345252990723)

Distribution of Clusters in the Course of the Stories

Distribution of Clusters in the Course of the Stories

Distribution of Clusters in the Course of the Stories

Distribution of Clusters in the Course of the Stories

Distribution of Clusters in the Course of the Stories

Interaction Design and Art

Nadine Prigann, Interaktionsdesignerin, ZHdK Zürich: Explorative Spatial Analysis (2018)

Example 2: A (little bit more) Discourse Oriented View on Conversations

Power of Diagrams to Give New Perspectives on Data

Growth Rings (Dendrochronology)

Von Arnoldius - Eigenes Werk (selbst erstelltes Foto), CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=568944

Growth Rings

Comparison of Conversations

FOLK_E_00080:
Informal dialogue: Pause in the theatre

FOLK_E_00012:
Informal dialogue: Interaction with children (playing a game)

FOLK_E_00120:
Institutional communication: school class

Visualizations and Thought Styles

Diagrams in Science and the Humanities

Diagrams and Cultures of Knowledge?

Diagrams and Digital Technologies?

Thought Styles (Ludwik Fleck)

I should like to mention two measures which are at the disposal of the scientific thought-style for giving the character of things to its creations.

  • One of them is technical terms, […].
  • Another measure is the scientific device […].

Whoever can look into the telescope and think of Saturn consequently uses a certain definite thought-style. There is no other possibility for him: he must recognize the ring of Saturn as a reality independent of himself, and his own thought-style as the only 'good' one.

Fleck: The Problem of Epistempology (1936); see Fix (2011)

Diagrams Express Thought Styles (Ludwik Fleck)

Language Diagrams Instruments

Diagrams are more than means to visualize. They are something between language and instruments in scientific processes and express thought styles (and their changes).

Code – Visualizations – Thought Styles

Language Coding Diagrams Instruments

Diagrams are more than means to visualize. They are something between language and instruments in scientific processes and express thought styles (and their changes).

Programming embedded in "Coding Cultures" influence visualizations and their interpretation.

Sometimes, Materialization is Necessary: Automatic Poems by "Goettherina"

Goettherina2

2019, Noah Bubenhofer

Goettherina

www.bubenhofer.com
http://www.bubenhofer.com/presentations/2020-02-19_Darmstadt/