Iconicity ratings are a key tool in psycholinguistic studies of vocabulary. This figure shows the distribution of ratings for 14,000 English words in two ways: (a) A kernel density plot of the distribution of average ratings; the dashed line indicates a normal distribution with the same mean and standard deviation; (b) standard deviations across raters (y-axis) as a function of average rating (x-axis). Extreme values are rarer, but people agree more strongly on them. (Figure by first author Bodo Winter, open data here.)
Response tokens like English mhmm, uhuhh, yeah or Catalan mm, sí, vale are tricky to study in the wild: their phonetic realizations can be quite different from how they are transcribed. Here we use UMAP, a method for dimensionality reduction used in bioacoustics and other fields, to explore the shape of inventories of response tokens in 16 languages. Every point represents a single response token; the closer two points are the more similar they are acoustically. Spectrograms drawn around the rim of the plots provide a direct view of the acoustic structure of tokens and enable quick sanity checks.
L: Distributions of durations of utterances and sentences (in ms) in corpora of informal conversation (blue) and CommonVoice ASR training sets (red) in Hungarian, Dutch, and Catalan. Modal duration and annotation content differ dramatically by data type: 496ms (6 words, 27 characters) for conversational turns and 4642ms (10 words, 58 characters) for ASR training items. R: Visualization of tokens that feature more prominently in conversational data (blue) and ASR training data (red) in Dutch. Source data: 80k randomsampled items from the Corpus of Spoken Dutch (Taalunie, 2014) and the Common Voice corpus for automatic speech recognition in Dutch (Ardila et al., 2020), based on Scaled F score metric, plotted using scattertext (Kessler, 2017)
Most NLP methods and models focus on text rather than talk. What are they missing? Scattertext plot of words and phrases characteristic of spoken interaction (green) versus written text (purple) in English, with words most characteristic of conversational interaction in the upper left (and shown in a separate inset on the right). High-frequency metacommunicative interjections like uhhuh, hm, wow, um are most typical of talk, and most often underrepresented in text.
A Across the Indo-European language family, the proportion of rough words with /r/ is much higher than the proportion of smooth words with /r/; B Each dot represents a language (size of the circle = number of words); whiskers show 95% Bayesian credible intervals corresponding to the mixed-effects Bayesian logistic regression analysis indicating that rough words have a much higher proportion of /r/ (posterior mean = 63%) than smooth words (posterior mean= 35%).
Proportional number of publications cataloged in Web of Science (1900–2017), showing concurrent upsurges in six topics related to iconicity (corrected for overall publication volume).
The intersection of iconicity and funniness ratings for 1419 words. A: Scatterplot of iconicity and funniness ratings in which each dot corresponds to a word. A loess function generates the smoothed conditional mean with 0.95 confidence interval. Panels B and C show the distribution of iconicity and funniness ratings in this dataset.
The relation between structural markedness and funniness ratings (A), iconicity ratings (B), and funniness and iconicity together (C), in a set of 1.419 English words. Each dot represents 14 or 15 words. Solid line with smoothed mean shows cumulative markedness. Other lines show relative prevalence of complex onsets (flap), codas (clunk), and verbal diminutives (drizzle). Higher structural markedness goes together with higher iconicity and funniness ratings. This supports the theory of structural markedness as a metacommunicative cue.
Vowel-colour associations for 1164 participants (central panel), showing, clockwise from bottom left, (a) a participant with very low structure yet high consistency across trials, probably a false positive synaesthete, (b) a typical nonsynaesthete with mappings that are both inconsistent and unstructured; (c) a middling participant with significant structure but inconistent choices across trials; (d) a highly structured but inconsistent participant; and (e) a typical vowel-colour synaesthete, with highly structured, consistent and categorical mappings.