Simulating phonetic evolution

Plots of where in a phonetic possibility space different words end up after 10,000 rounds of interaction, across 20 independent simulation runs (each cloud of 100 exemplar dots/triangles represents a single word at round 10,000 of a single simulation run). Blue, yellow, green and orange are regular words; purple is the continuer word. On each independent simulation run, all words are initialised at randomly selected positions in the space. A shows a selection of 6 separate simulation runs chosen for illustrative purposes (showing how regular words end up in different positions); B shows the end-state of all 20 simulation runs overlaid. Parameter settings: (i) minimal effort bias 3 times as strong for continuer word (G=1250) than for regular vocabulary words (G=5000), and (ii) the bias for reuse of features (i.e. segment-similarity bias) is not applied to the continuer category.

Dingemanse, M., Liesenfeld, A., & Woensdregt, M. (2022). Convergent cultural evolution of continuers (mmhm). The Evolution of Language: Proceedings of the Joint Conference on Language Evolution (JCoLE), 61–67. Download

Sequential context of continuers

A Candidate continuer forms in 10 unrelated languages, B shown in their natural sequential ecology (annotations as in the original data), C with spectrograms and pitch traces of representative tokens made using the Parselmouth interface to Praat (Jadoul et al., 2018; Boersma & Weenink, 2013).

Dingemanse, M., Liesenfeld, A., & Woensdregt, M. (2022). Convergent cultural evolution of continuers (mmhm). The Evolution of Language: Proceedings of the Joint Conference on Language Evolution (JCoLE), 61–67. Download

Sampling response tokens

A. Overview of included languages with dataset size in hours and top 3 sequentially identified response tokens as transcribed in the corpus. B. Location of largest speech community. C. Assessing the impact of sparse data on UMAP projections using three samples of Dutch response tokens. A look at the full dataset (a) and random-sampled subsets of decreasing size (b, c) suggests isomorphism across scales and interpretability of clustering solutions as small as 150 tokens.

Liesenfeld, A., & Dingemanse, M. (2022). Bottom-up discovery of structure and variation in response tokens (‘backchannels’) across diverse languages. Proceedings of Interspeech 2022, 1126–1130. https://doi.org/10.21437/Interspeech.2022-11288 Download

Cultural evolution of continuers

Continuers (frequent standalone utterances like mm-hm that people often use in succession) differ in interesting ways from other elements that are common, like top tokens (the most common words in a corpus) and discontinuers (frequent standalone utterances that people do not produce in successive streaks). A. Length of tokens for continuers, discontinuers and top tokens in 32 languages. B. Frequencies of major sound classes across types. Vowel nuclei occur across types, but continuers stand out for their preferences for nasals. C. Random forest analysis of 118 continuer forms in 32 spoken languages showing the top 10 most predictive phonemes (out of 29 attested).

Dingemanse, M., Liesenfeld, A., & Woensdregt, M. (2022). Convergent cultural evolution of continuers (mmhm). The Evolution of Language: Proceedings of the Joint Conference on Language Evolution (JCoLE), 61–67. Download

Quality control for conversational corpora

Conversational data can be transcribed in many ways. This panel provides a quick way to gauge the quality of transcriptions, here illustrated with data from Ambel (Arnold, 2017). A. Distribution of the timing of dyadic turn-transitions with positive values representing gaps between turns and negative values representing overlaps.
This kind of normal distribution centered around 0 ms is typical; when corpora starkly diverge from this it usually indicates noninteractive data, or segmentation methods that do not represent the actual timing of utterances. B. Distribution of transition time by duration, allowing the spotting of outliers and artefacts of automation (e.g. many turns of similar durations). C. A frequency/rank plot allows a quick sanity check of expected power law distributions and a look at the most frequent tokens in the corpus. D. Three randomly selected 10 second stretches of dyadic conversation give an impression of the timing and content of annotations in the corpus.

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. https://aclanthology.org/2022.lrec-1.126 Download

How ASR training data differs from real conversation

L: Distributions of durations of utterances and sentences (in ms) in corpora of informal conversation (blue) and CommonVoice ASR training sets (red) in Hungarian, Dutch, and Catalan. Modal duration and annotation content differ dramatically by data type: 496ms (6 words, 27 characters) for conversational turns and 4642ms (10 words, 58 characters) for ASR training items. R: Visualization of tokens that feature more prominently in conversational data (blue) and ASR training data (red) in Dutch. Source data: 80k randomsampled items from the Corpus of Spoken Dutch (Taalunie, 2014) and the Common Voice corpus for automatic speech recognition in Dutch (Ardila et al., 2020), based on Scaled F score metric, plotted using scattertext (Kessler, 2017)

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. https://aclanthology.org/2022.lrec-1.126 Download

Mhmm over time

Even apparently universal patterns (like the use of ‘mhm’ during tellings) can show important cross-cultural differences. A. Continuers (marked ○) are among the most frequent recipient behaviours in both English and Korean, shown here in four 80 second stretches of tellings. B. However, the relative frequency of continuers is about twice as high in Korean based on 100 random samples of 80 second segments in both languages: on average, 21% of turns are continuers in Korean, against 9% of turns in English (measures expressed this way to control for speech rate differences).

Dingemanse, M., & Liesenfeld, A. (2022). From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5614–5633. https://doi.org/10.18653/v1/2022.acl-long.385 Download

/r/ for rough in Indo-European

A Across the Indo-European language family, the proportion of rough words with /r/ is much higher than the proportion of smooth words with /r/; B Each dot represents a language (size of the circle = number of words); whiskers show 95% Bayesian credible intervals corresponding to the mixed-effects Bayesian logistic regression analysis indicating that rough words have a much higher proportion of /r/ (posterior mean = 63%) than smooth words (posterior mean= 35%).

Winter, B., Sóskuthy, M., Perlman, M., & Dingemanse, M. (2022). Trilled /r/ is associated with roughness, linking sound and touch across spoken languages. Scientific Reports, 12(1), 1035. https://doi.org/10.1038/s41598-021-04311-7 Download

Iconicity and funniness ratings

The intersection of iconicity and funniness ratings for 1419 words. A: Scatterplot of iconicity and funniness ratings in which each dot corresponds to a word. A loess function generates the smoothed conditional mean with 0.95 confidence interval. Panels B and C show the distribution of iconicity and funniness ratings in this dataset.

Dingemanse, M., & Thompson, B. (2020). Playful iconicity: structural markedness underlies the relation between funniness and iconicity. Language and Cognition, 12(1), 203–224. https://doi.org/10.1017/langcog.2019.49 Download

Vowel-colour associations

Vowel-colour associations for 1164 participants (central panel), showing, clockwise from bottom left, (a) a participant with very low structure yet high consistency across trials, probably a false positive synaesthete, (b) a typical nonsynaesthete with mappings that are both inconsistent and unstructured; (c) a middling participant with significant structure but inconistent choices across trials; (d) a highly structured but inconsistent participant; and (e) a typical vowel-colour synaesthete, with highly structured, consistent and categorical mappings.

Cuskley, C., Dingemanse, M., Kirby, S., & van Leeuwen, T. M. (2019). Cross-modal associations and synesthesia: Categorical perception and structure in vowel–color mappings in a large online sample. Behavior Research Methods, 51(4), 1651–1675. https://doi.org/10.3758/s13428-019-01203-7 Download

Arbitrariness, iconicity and systematicity

(A, B) Words show arbitrariness when there are conventional associations between forms and meanings. Words show iconicity when there are perceptuomotor analogies between forms and meanings, here indicated by shape, size and proximity (inset). (B, C) Words show systematicity when statistical regularities in phonological form, here indicated by color, serve as cues to abstract categories such as word classes. (D) The cues involved in systematicity differ across languages and may be arbitrary. (E) The perceptual analogies involved in iconicity transcend languages and may be universal.

Dingemanse, M., Blasi, D. E., Lupyan, G., Christiansen, M. H., & Monaghan, P. (2015). Arbitrariness, iconicity and systematicity in language. Trends in Cognitive Sciences, 19(10), 603–615. https://doi.org/10.1016/j.tics.2015.07.013 Download