A. Overview of included languages with dataset size in hours and top 3 sequentially identified response tokens as transcribed in the corpus. B. Location of largest speech community. C. Assessing the impact of sparse data on UMAP projections using three samples of Dutch response tokens. A look at the full dataset (a) and random-sampled subsets of decreasing size (b, c) suggests isomorphism across scales and interpretability of clustering solutions as small as 150 tokens.
Response tokens like English mhmm, uhuhh, yeah or Catalan mm, sí, vale are tricky to study in the wild: their phonetic realizations can be quite different from how they are transcribed. Here we use UMAP, a method for dimensionality reduction used in bioacoustics and other fields, to explore the shape of inventories of response tokens in 16 languages. Every point represents a single response token; the closer two points are the more similar they are acoustically. Spectrograms drawn around the rim of the plots provide a direct view of the acoustic structure of tokens and enable quick sanity checks.
MDS plot of similarity ratings for ideophones derived from a pile-sorting field task. Interpretable clusters are circled and indicated in the plot. One group, with saaa ‘cool sensation’, nyagbalaa ‘pungent’, buàà ‘tasteless’, nyɛ̃kɛ̃nyɛ̃kɛ̃ ‘intensely sweet’ and mɛ̃rɛ̃mɛ̃rɛ̃ ‘sweet’, can be characterised as TASTE. Another cluster includes dɔbɔrɔɔ ‘soft’, safaraa ‘coarse-grained’, wòsòròò ‘rough’, fũɛ̃ fũɛ̃ ‘malleable’, wùrùfùù ‘fluffy’, pɔlɔpɔlɔ ‘smooth’, fiɛfiɛ ‘silky’, kpɔlɔkpɔlɔ ‘slippery’ and pɔtɔpɔtɔ ‘soggy’. These ideophones seem to form a domain of HAPTIC TOUCH. Another group is comprised of gelegele ‘shiny’, fututu ‘pure white’, kpinakpina ‘black’ and wɔ̃̀rã̀wɔ̃̀rã̀ ‘spotted’. This domain we may summarise as SURFACE APPEARANCE. A further cluster is formed by minimini ‘spherical’ and gìlìgìlì ‘circular’ (these two tightly together) and sɔ̀dzɔ̀lɔ̀ɔ̀ ‘oblong’, miɔmiɔ ‘pointed’ and tagbaraa ‘long’, suggesting a broader domain of SHAPE.