The occurrence of interjections in 10-min excerpts of informal dyadic conversations in six spoken languages. Every panel shows the turns of a dyadic exchange; colored dots indicate turns that belong to the top 10 most common one-word standalone turn formats in the language. These excerpts cannot support strong comparative or typological inferences; they are only meant to give an impression of the prevalence of interjections across unrelated languages.
A With interactive repair, another participant initiates repair, inviting a repair solution by the first; the repair initiation is a pivot, pointing both back and forward. B While a fitted response is preferred, initiating repair is always a possible next move; likewise, within repair, while a restricted format is preferred, an open format is always an option. C Across diverse languages, formats for interactive repair range fall into three types, depending on how they target the trouble in prior turn and the kind of response they typically invite; these can be ranked from less to more specific in terms of the grasp of the trouble source they display. D Empirical cumulative distribution of independent repair sequences (black curve) as they occur over time in informal conversation in a global sample of 12 languages (grey curves). Across languages, the steepest part of the slope is around 17 s, the average 84 s, and nearly all sequences occur within a 4-min window from the last.
A Word error rates (WER) for five speech-to-text systems in six languages. B One minute of English conversation as annotated by human transcribers (top) and by five speech-to-text systems, showing that while most do some diarization, all underestimate the number of transitions and none represent overlapping turns (Whisper offers no diarization). C Speaker transitions and distribution of floor transfer offset times (all languages), showing that even ASR systems that support diarization do not represent overlapping annotations in their output.
How different speech recognition engines warp dialog act classification in the same dataset of conversational English. For 8 frequent dialog acts, coloured lines show dialog acts based on ASR output deviate from those based on human transcripts of the same data (baseline). Dot size scales to number of times a tag is assigned. Only the most frequently assigned dialog acts (with at least 25 tokens in at least one dataset) are shown here. Mean absolute percentage deviations by ASR system: nemo 27.8%, amazon 31.4%, whisper 33.8%, rev 47.4%.
Most NLP methods and models focus on text rather than talk. What are they missing? Scattertext plot of words and phrases characteristic of spoken interaction (green) versus written text (purple) in English, with words most characteristic of conversational interaction in the upper left (and shown in a separate inset on the right). High-frequency metacommunicative interjections like uhhuh, hm, wow, um are most typical of talk, and most often underrepresented in text.
Assessing the timing of turn-taking requires careful operationalisation. The largest comparative study so far (Stivers et al., 2009) looked at polar questions and their answers in order to have a directly comparable sequential context.
In our paper on conversational corpora, we use this same sequential context, and compare it to the larger set of dyadic speaker transitions in interaction. Given the broad-scale comparability of the overall timing distributions (in grey) and the more controlled subset of at least 250 question-answer sequences per language (in black), we conclude that QA sequences can act as a useful proxy for timing in general (supporting Stivers et al. 2009), but also that QA-sequences are not necessary for a relatively robust impression of overall timing.
Interactional challenges to be negotiated in recruitment sequences, along with some of the interactional practices mobilized to address them.
Not a figure, I know, but sometimes tables are the only way to bring multidimensional problem space into view. In this case, the table is also a map to the resources discussed in the paper.
Bella is holding Aku’s phone and taking a call Aku asked her to pick up. Speaking into the phone, she notes she is ‘not sister Aku’. When it becomes clear the caller wants Aku, Aku asks Bella to give the phone back, adding a gesture of reaching out to receive the phone. These kinds of events (‘recruitments’) are frequent in everyday interaction and show how people weave together talk and action to get others to do things.
There are a myriad ways to refer to places, but one useful way to think about their affordances in interaction is in terms of a distinction between locations and settings. Locations tell you where something is; settings invoke activities and actors. Many place references usefully combine the two: setting a story in the graveyard area not only localizes it for the audience in the know, but also provides a setting for ominous encounters.
Interactive repair —when people work together to fix trouble in conversation— is quite common. In these 12 languages from around the world, it takes only 84 seconds on average between one repair sequence and the next. The sheer frequency shows how important repair is as a system that keeps conversation on track and helps us negotiate common understanding in a world full of noise. We are united in asking questions.
Herb Clark, building on Austin’s (1962) distinctions of levels of speech acts, notes that successful communication is grounded in joint actions by speaker and addressee at at least four distinct levels. In the Austin/Clark action ladder, higher levels depend on lower levels in terms of causality (higher levels are implemented by means of lower ones) and entailment (completion of a higher level entails completion of the ones below it). As a corollary, the action ladder exhibits the property of “downward evidence”: evidence that B recognized A’s intended action (level 4) is also evidence that B succeeded in interpreting A’s words (level 3), that B correctly identified the words (level 2), and that B attended to A’s vocalisation (level 1). All four levels are involved in building mutual understanding, and each of them can be a locus of trouble.
The cultural evolution of continuous signals over 4 generations in a single experimental chain of iterated communication. Colour represents communicative success. Through trial and error, participants in consecutive trials narrow down to a set of signals that is both iconic (in mirroring aspects of form) and systematic (in using slope direction to signal the way animals are facing). This represents in miniature form how iconicity can provide the building blocks for systematicity in linguistic systems.