Sampling response tokens

A. Overview of included languages with dataset size in hours and top 3 sequentially identified response tokens as transcribed in the corpus. B. Location of largest speech community. C. Assessing the impact of sparse data on UMAP projections using three samples of Dutch response tokens. A look at the full dataset (a) and random-sampled subsets of decreasing size (b, c) suggests isomorphism across scales and interpretability of clustering solutions as small as 150 tokens.

Liesenfeld, A., & Dingemanse, M. (2022). Bottom-up discovery of structure and variation in response tokens (‘backchannels’) across diverse languages. Proceedings of Interspeech 2022, 1126–1130. doi: 10.21437/Interspeech.2022-11288 PDF