A. Overview of included languages with dataset size in hours and top 3 sequentially identified response tokens as transcribed in the corpus. B. Location of largest speech community. C. Assessing the impact of sparse data on UMAP projections using three samples of Dutch response tokens. A look at the full dataset (a) and random-sampled subsets of decreasing size (b, c) suggests isomorphism across scales and interpretability of clustering solutions as small as 150 tokens.
Response tokens like English mhmm, uhuhh, yeah or Catalan mm, sí, vale are tricky to study in the wild: their phonetic realizations can be quite different from how they are transcribed. Here we use UMAP, a method for dimensionality reduction used in bioacoustics and other fields, to explore the shape of inventories of response tokens in 16 languages. Every point represents a single response token; the closer two points are the more similar they are acoustically. Spectrograms drawn around the rim of the plots provide a direct view of the acoustic structure of tokens and enable quick sanity checks.