Iconicity ratings

Iconicity ratings are a key tool in psycholinguistic studies of vocabulary. This figure shows the distribution of ratings for 14,000 English words in two ways: (a) A kernel density plot of the distribution of average ratings; the dashed line indicates a normal distribution with the same mean and standard deviation; (b) standard deviations across raters (y-axis) as a function of average rating (x-axis). Extreme values are rarer, but people agree more strongly on them. (Figure by first author Bodo Winter, open data here.)

Winter, B., Lupyan, G., Perry, L. K., Dingemanse, M., & Perlman, M. (2023). Iconicity ratings for 14,000+ English words. Behavior Research Methods. doi: 10.3758/s13428-023-02112-6 PDF

Multimodal effort in repair sequences

Boxplots showing the joint amount of multimodal effort invested by both participants to resolve the interactional trouble. The boxes represent the interquartile range; the middle line the median; the whiskers the minimum and maximum scores (outliers excluded). Every dot represents a repair sequence, i.e., repair initiation and repair solution together. As the specificity of repair formats goes up, joint multimodal effort invested goes down.

Rasenberg, M., Pouw, W., Özyürek, A., & Dingemanse, M. (2022). The multimodal nature of communicative efficiency in social interaction. Scientific Reports, 12(1), 19111. doi: 10.1038/s41598-022-22883-w PDF

Cultural evolution of continuers

Continuers (frequent standalone utterances like mm-hm that people often use in succession) differ in interesting ways from other elements that are common, like top tokens (the most common words in a corpus) and discontinuers (frequent standalone utterances that people do not produce in successive streaks). A. Length of tokens for continuers, discontinuers and top tokens in 32 languages. B. Frequencies of major sound classes across types. Vowel nuclei occur across types, but continuers stand out for their preferences for nasals. C. Random forest analysis of 118 continuer forms in 32 spoken languages showing the top 10 most predictive phonemes (out of 29 attested).

Dingemanse, M., Liesenfeld, A., & Woensdregt, M. (2022). Convergent cultural evolution of continuers (mmhm). The Evolution of Language: Proceedings of the Joint Conference on Language Evolution (JCoLE), 61–67. PDF

Quality control for conversational corpora

Conversational data can be transcribed in many ways. This panel provides a quick way to gauge the quality of transcriptions, here illustrated with data from Ambel (Arnold, 2017). A. Distribution of the timing of dyadic turn-transitions with positive values representing gaps between turns and negative values representing overlaps.
This kind of normal distribution centered around 0 ms is typical; when corpora starkly diverge from this it usually indicates noninteractive data, or segmentation methods that do not represent the actual timing of utterances. B. Distribution of transition time by duration, allowing the spotting of outliers and artefacts of automation (e.g. many turns of similar durations). C. A frequency/rank plot allows a quick sanity check of expected power law distributions and a look at the most frequent tokens in the corpus. D. Three randomly selected 10 second stretches of dyadic conversation give an impression of the timing and content of annotations in the corpus.

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. https://aclanthology.org/2022.lrec-1.126 PDF

How ASR training data differs from real conversation

L: Distributions of durations of utterances and sentences (in ms) in corpora of informal conversation (blue) and CommonVoice ASR training sets (red) in Hungarian, Dutch, and Catalan. Modal duration and annotation content differ dramatically by data type: 496ms (6 words, 27 characters) for conversational turns and 4642ms (10 words, 58 characters) for ASR training items. R: Visualization of tokens that feature more prominently in conversational data (blue) and ASR training data (red) in Dutch. Source data: 80k randomsampled items from the Corpus of Spoken Dutch (Taalunie, 2014) and the Common Voice corpus for automatic speech recognition in Dutch (Ardila et al., 2020), based on Scaled F score metric, plotted using scattertext (Kessler, 2017)

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. https://aclanthology.org/2022.lrec-1.126 PDF

Zipf in conversation

Frequency/rank distributions of tokenized items (‘words’) and recurring turn formats in conversational corpora with at least 20 such turn formats, representing 22 languages (8 phyla). Tokenized items (blue) show a linear frequency/rank relation in log/log space. Recurring turn formats (whether one-word ○ or multi-word +) appear to obey a similar frequency/rank distribution for the 20% of turns that occur >20 times (purple), tapering off towards lower frequencies and unique turns (grey). Fit fluctuates with corpus size and the parallelism of distributions is most apparent in larger corpora.

Dingemanse, M., & Liesenfeld, A. (2022). From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5614–5633. doi: 10.18653/v1/2022.acl-long.385 PDF

/r/ for rough in Indo-European

A Across the Indo-European language family, the proportion of rough words with /r/ is much higher than the proportion of smooth words with /r/; B Each dot represents a language (size of the circle = number of words); whiskers show 95% Bayesian credible intervals corresponding to the mixed-effects Bayesian logistic regression analysis indicating that rough words have a much higher proportion of /r/ (posterior mean = 63%) than smooth words (posterior mean= 35%).

Winter, B., Sóskuthy, M., Perlman, M., & Dingemanse, M. (2022). Trilled /r/ is associated with roughness, linking sound and touch across spoken languages. Scientific Reports, 12(1), 1035. doi: 10.1038/s41598-021-04311-7 PDF

Gesture kinematics

Setup of a study using motion tracking to investigate continuous properties of evolving manual signals. Panel a: Seed gestures for a fixed set of meanings are learned by next generations in an iterative learning experiment. Panel b: Using motion tracking, we derive automatic kinematic measures of entropy, temporal variability and intermittency over time and over generations.

Pouw, W., Dingemanse, M., Motamedi, Y., & Özyürek, A. (2021). A Systematic Investigation of Gesture Kinematics in Evolving Manual Languages in the Lab. Cognitive Science, 45(7), e13014. doi: 10.1111/cogs.13014 PDF

The iconicity boom

Proportional number of publications cataloged in Web of Science (1900–2017), showing concurrent upsurges in six topics related to iconicity (corrected for overall publication volume).

Nielsen, A. K. S., & Dingemanse, M. (2021). Iconicity in Word Learning and Beyond: A Critical Review. Language and Speech, 64(1), 52–72. doi: 10.1177/0023830920914339 PDF

Computational complexity of repair and pragmatic reasoning

A comparison of computational complexity (in basic computation steps) by agent type and lexicon size. The main take-away from this figure is that complexity increases exponentially with lexicon size for pragmatic agents, but only linearly for interactional agents. Three types of agent are compared, each with three lexicon sizes. The Interactional agent is a model equipped with a simple form of repair and no pragmatic reasoning. The other two agents cannot initiate repair, but instead feature pragmatic reasoning. The Frugally Pragmatic agent is a model that only uses complex pragmatic reasoning above a certain uncertainty threshold; the Fully Pragmatic agent always uses it. For interactional agents with a 6 × 4 lexicon no data is visible as the computation cost is very small (48) relative to the range of the y-axis.

Arkel, J. van, Woensdregt, M., Dingemanse, M., & Blokpoel, M. (2020). A simple repair mechanism can alleviate computational demands of pragmatic reasoning: simulations and complexity analysis. Proceedings of the 24th Conference on Computational Natural Language Learning. doi: 10.18653/v1/2020.conll-1.14 PDF

Vowel-colour associations

Vowel-colour associations for 1164 participants (central panel), showing, clockwise from bottom left, (a) a participant with very low structure yet high consistency across trials, probably a false positive synaesthete, (b) a typical nonsynaesthete with mappings that are both inconsistent and unstructured; (c) a middling participant with significant structure but inconistent choices across trials; (d) a highly structured but inconsistent participant; and (e) a typical vowel-colour synaesthete, with highly structured, consistent and categorical mappings.

Cuskley, C., Dingemanse, M., Kirby, S., & van Leeuwen, T. M. (2019). Cross-modal associations and synesthesia: Categorical perception and structure in vowel–color mappings in a large online sample. Behavior Research Methods, 51(4), 1651–1675. doi: 10.3758/s13428-019-01203-7 PDF

Codability of sensory domains

The hierarchy of the senses across languages according to the mean codability of each domain, with the presumed universal Aristotelian hierarchy on Top. There is no universal hierarchy of the senses across diverse languages worldwide. (Figure by coauthor Sean G. Roberts, open data here.)

Majid, A., Roberts, S. G., Cilissen, L., Emmorey, K., Nicodemus, B., O’Grady, L., Woll, B., LeLan, B., de Sousa, H., Cansler, B. L., Shayan, S., de Vos, C., Senft, G., Enfield, N. J., Razak, R. A., Fedden, S., Tufvesson, S., Dingemanse, M., Ozturk, O., … Levinson, S. C. (2018). Differential coding of perception in the world’s languages. Proceedings of the National Academy of Sciences, 115(45), 11369–11376. doi: 10.1073/pnas.1720419115 PDF

Probability of encountering repair

Interactive repair —when people work together to fix trouble in conversation— is quite common. In these 12 languages from around the world, it takes only 84 seconds on average between one repair sequence and the next. The sheer frequency shows how important repair is as a system that keeps conversation on track and helps us negotiate common understanding in a world full of noise. We are united in asking questions.

Dingemanse, M., Roberts, S. G., Baranova, J., Blythe, J., Drew, P., Floyd, S., Gisladottir, R. S., Kendrick, K. H., Levinson, S. C., Manrique, E., Rossi, G., & Enfield, N. J. (2015). Universal Principles in the Repair of Communication Problems. PLOS ONE, 10(9), e0136100. doi: 10.1371/journal.pone.0136100

Cultural evolution of continuous signals

The cultural evolution of continuous signals over 4 generations in a single experimental chain of iterated communication. Colour represents communicative success. Through trial and error, participants in consecutive trials narrow down to a set of signals that is both iconic (in mirroring aspects of form) and systematic (in using slope direction to signal the way animals are facing). This represents in miniature form how iconicity can provide the building blocks for systematicity in linguistic systems.

Dingemanse, M., Verhoef, T., & Roberts, S. G. (2014). The role of iconicity in the cultural evolution of communicative signals. In B. de Boer & T. Verhoef (Eds.), Proceedings of Evolang X Workshop on Signals, Speech and Signs (pp. 11–15). PDF