How conversational data challenges speech recognition (ASR)

A Word error rates (WER) for five speech-to-text systems in six languages. B One minute of English conversation as annotated by human transcribers (top) and by five speech-to-text systems, showing that while most do some diarization, all underestimate the number of transitions and none represent overlapping turns (Whisper offers no diarization). C Speaker transitions and distribution of floor transfer offset times (all languages), showing that even ASR systems that support diarization do not represent overlapping annotations in their output.

Liesenfeld, A., Lopez, A., & Dingemanse, M. (2023). The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems. Proceedings of the 24th Annual SIGdial Meeting on Discourse and Dialogue. doi: 10.1145/3571884.3604316 PDF

Quality control for conversational corpora

Conversational data can be transcribed in many ways. This panel provides a quick way to gauge the quality of transcriptions, here illustrated with data from Ambel (Arnold, 2017). A. Distribution of the timing of dyadic turn-transitions with positive values representing gaps between turns and negative values representing overlaps.
This kind of normal distribution centered around 0 ms is typical; when corpora starkly diverge from this it usually indicates noninteractive data, or segmentation methods that do not represent the actual timing of utterances. B. Distribution of transition time by duration, allowing the spotting of outliers and artefacts of automation (e.g. many turns of similar durations). C. A frequency/rank plot allows a quick sanity check of expected power law distributions and a look at the most frequent tokens in the corpus. D. Three randomly selected 10 second stretches of dyadic conversation give an impression of the timing and content of annotations in the corpus.

Liesenfeld, A., & Dingemanse, M. (2022). Building and curating conversational corpora for diversity-aware language science and technology. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–1192. https://aclanthology.org/2022.lrec-1.126 PDF

Timing of yes/no sequences

Assessing the timing of turn-taking requires careful operationalisation. The largest comparative study so far (Stivers et al., 2009) looked at polar questions and their answers in order to have a directly comparable sequential context.

In our paper on conversational corpora, we use this same sequential context, and compare it to the larger set of dyadic speaker transitions in interaction. Given the broad-scale comparability of the overall timing distributions (in grey) and the more controlled subset of at least 250 question-answer sequences per language (in black), we conclude that QA sequences can act as a useful proxy for timing in general (supporting Stivers et al. 2009), but also that QA-sequences are not necessary for a relatively robust impression of overall timing.

Dingemanse, M., & Liesenfeld, A. (2022). From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5614–5633. doi: 10.18653/v1/2022.acl-long.385 PDF