How conversational data challenges speech recognition (ASR)

A Word error rates (WER) for five speech-to-text systems in six languages. B One minute of English conversation as annotated by human transcribers (top) and by five speech-to-text systems, showing that while most do some diarization, all underestimate the number of transitions and none represent overlapping turns (Whisper offers no diarization). C Speaker transitions and distribution of floor transfer offset times (all languages), showing that even ASR systems that support diarization do not represent overlapping annotations in their output.

Liesenfeld, A., Lopez, A., & Dingemanse, M. (2023). The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems. Proceedings of the 24th Annual SIGdial Meeting on Discourse and Dialogue, 482–495. doi: 10.18653/v1/2023.sigdial-1.45 PDF

Iconicity ratings

Iconicity ratings are a key tool in psycholinguistic studies of vocabulary. This figure shows the distribution of ratings for 14,000 English words in two ways: (a) A kernel density plot of the distribution of average ratings; the dashed line indicates a normal distribution with the same mean and standard deviation; (b) standard deviations across raters (y-axis) as a function of average rating (x-axis). Extreme values are rarer, but people agree more strongly on them. (Figure by first author Bodo Winter, open data here.)

Winter, B., Lupyan, G., Perry, L. K., Dingemanse, M., & Perlman, M. (2023). Iconicity ratings for 14,000+ English words. Behavior Research Methods. doi: 10.3758/s13428-023-02112-6 PDF

Iconicity measures across tasks

Discriminability of iconicity measures from different tasks. Iconicity ratings have been transformed so that they vary between 0 and 1 (to compare with guessing accuracies). Guesses —where people try to guess the meaning of an iconic word, or the word form belonging to a given meaning— appear to be somewhat more evenly spread than ratings. Iconicity ratings by native speakers (rightmost, showing data from Thompson et al. 2020) are on average higher than iconicity ratings by people who don’t speak the language whose words they rate, confirming the notion that native speakers will generally feel that words of their own language are more iconic. (Figure by Bonnie McLean, open data here.)

McLean, B., Dunn, M., & Dingemanse, M. (2023). Two measures are better than one: combining iconicity ratings and guessing experiments for a more nuanced picture of iconicity in the lexicon. Language and Cognition, 15(4), 719–739. doi: 10.1017/langcog.2023.9 PDF

Quality control for conversational corpora

Conversational data can be transcribed in many ways. This panel provides a quick way to gauge the quality of transcriptions, here illustrated with data from Ambel (Arnold, 2017). A. Distribution of the timing of dyadic turn-transitions with positive values representing gaps between turns and negative values representing overlaps.
This kind of normal distribution centered around 0 ms is typical; when corpora starkly diverge from this it usually indicates noninteractive data, or segmentation methods that do not represent the actual timing of utterances. B. Distribution of transition time by duration, allowing the spotting of outliers and artefacts of automation (e.g. many turns of similar durations). C. A frequency/rank plot allows a quick sanity check of expected power law distributions and a look at the most frequent tokens in the corpus. D. Three randomly selected 10 second stretches of dyadic conversation give an impression of the timing and content of annotations in the corpus.

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. PDF

How ASR training data differs from real conversation

L: Distributions of durations of utterances and sentences (in ms) in corpora of informal conversation (blue) and CommonVoice ASR training sets (red) in Hungarian, Dutch, and Catalan. Modal duration and annotation content differ dramatically by data type: 496ms (6 words, 27 characters) for conversational turns and 4642ms (10 words, 58 characters) for ASR training items. R: Visualization of tokens that feature more prominently in conversational data (blue) and ASR training data (red) in Dutch. Source data: 80k randomsampled items from the Corpus of Spoken Dutch (Taalunie, 2014) and the Common Voice corpus for automatic speech recognition in Dutch (Ardila et al., 2020), based on Scaled F score metric, plotted using scattertext (Kessler, 2017)

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. PDF

Timing of yes/no sequences

Assessing the timing of turn-taking requires careful operationalisation. The largest comparative study so far (Stivers et al., 2009) looked at polar questions and their answers in order to have a directly comparable sequential context.

In our paper on conversational corpora, we use this same sequential context, and compare it to the larger set of dyadic speaker transitions in interaction. Given the broad-scale comparability of the overall timing distributions (in grey) and the more controlled subset of at least 250 question-answer sequences per language (in black), we conclude that QA sequences can act as a useful proxy for timing in general (supporting Stivers et al. 2009), but also that QA-sequences are not necessary for a relatively robust impression of overall timing.

Dingemanse, M., & Liesenfeld, A. (2022). From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5614–5633. doi: 10.18653/v1/2022.acl-long.385 PDF

Iconicity and funniness ratings

The intersection of iconicity and funniness ratings for 1419 words. A: Scatterplot of iconicity and funniness ratings in which each dot corresponds to a word. A loess function generates the smoothed conditional mean with 0.95 confidence interval. Panels B and C show the distribution of iconicity and funniness ratings in this dataset.

Dingemanse, M., & Thompson, B. (2020). Playful iconicity: structural markedness underlies the relation between funniness and iconicity. Language and Cognition, 12(1), 203–224. doi: 10.1017/langcog.2019.49 PDF