Frequency/rank distributions of tokenized items (‘words’) and recurring turn formats in conversational corpora with at least 20 such turn formats, representing 22 languages (8 phyla). Tokenized items (blue) show a linear frequency/rank relation in log/log space. Recurring turn formats (whether one-word ○ or multi-word +) appear to obey a similar frequency/rank distribution for the 20% of turns that occur >20 times (purple), tapering off towards lower frequencies and unique turns (grey). Fit fluctuates with corpus size and the parallelism of distributions is most apparent in larger corpora.
2473932
IM7WXJQI
items
1
0
default
asc
1
1965
https://markdingemanse.net/wp-content/plugins/zotpress/
%7B%22status%22%3A%22success%22%2C%22updateneeded%22%3Afalse%2C%22instance%22%3A%22zotpress-f5db6e8ae99b9466023ea6211a18dbb8%22%2C%22meta%22%3A%7B%22request_last%22%3A0%2C%22request_next%22%3A0%2C%22used_cache%22%3Atrue%7D%2C%22data%22%3A%5B%7B%22key%22%3A%22IM7WXJQI%22%2C%22library%22%3A%7B%22id%22%3A2473932%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Dingemanse%20and%20Liesenfeld%22%2C%22parsedDate%22%3A%222022%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EDingemanse%2C%20M.%2C%20%26amp%3B%20Liesenfeld%2C%20A.%20%282022%29.%20From%20text%20to%20talk%3A%20Harnessing%20conversational%20corpora%20for%20humane%20and%20diversity-aware%20language%20technology.%20%3Ci%3EProceedings%20of%20the%2060th%20Annual%20Meeting%20of%20the%20Association%20for%20Computational%20Linguistics%20%28Volume%201%3A%20Long%20Papers%29%3C%5C%2Fi%3E%2C%205614%26%23x2013%3B5633.%20doi%3A%20%3Ca%20class%3D%27zp-doi-link%27%20href%3D%27https%3A%5C%2F%5C%2Fdoi.org%5C%2F10.18653%5C%2Fv1%5C%2F2022.acl-long.385%27%3E10.18653%5C%2Fv1%5C%2F2022.acl-long.385%3C%5C%2Fa%3E%20%3Ca%20title%3D%27Download%27%20class%3D%27zp-DownloadURL%27%20href%3D%27https%3A%5C%2F%5C%2Faclanthology.org%5C%2F2022.acl-long.385.pdf%27%3EPDF%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22From%20text%20to%20talk%3A%20Harnessing%20conversational%20corpora%20for%20humane%20and%20diversity-aware%20language%20technology%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Mark%22%2C%22lastName%22%3A%22Dingemanse%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Andreas%22%2C%22lastName%22%3A%22Liesenfeld%22%7D%5D%2C%22abstractNote%22%3A%22Informal%20social%20interaction%20is%20the%20primordial%20home%20of%20human%20language.%20Linguistically%20diverse%20conversational%20corpora%20are%20an%20important%20and%20largely%20untapped%20resource%20for%20computational%20linguistics%20and%20language%20technology.%20Through%20the%20efforts%20of%20a%20worldwide%20language%20documentation%20movement%2C%20such%20corpora%20are%20increasingly%20becoming%20available.%20We%20show%20how%20interactional%20data%20from%2063%20languages%20%2826%20families%29%20harbours%20insights%20about%20turn-taking%2C%20timing%2C%20sequential%20structure%20and%20social%20action%2C%20with%20implications%20for%20language%20technology%2C%20natural%20language%20understanding%2C%20and%20the%20design%20of%20conversational%20interfaces.%20Harnessing%20linguistically%20diverse%20conversational%20corpora%20will%20provide%20the%20empirical%20foundations%20for%20flexible%2C%20localizable%2C%20humane%20language%20technologies%20of%20the%20future.%22%2C%22date%22%3A%222022%22%2C%22proceedingsTitle%22%3A%22Proceedings%20of%20the%2060th%20Annual%20Meeting%20of%20the%20Association%20for%20Computational%20Linguistics%20%28Volume%201%3A%20Long%20Papers%29%22%2C%22conferenceName%22%3A%22%22%2C%22language%22%3A%22%22%2C%22DOI%22%3A%2210.18653%5C%2Fv1%5C%2F2022.acl-long.385%22%2C%22ISBN%22%3A%22%22%2C%22url%22%3A%22%22%2C%22collections%22%3A%5B%225HT8V9GS%22%2C%228QRFTBVS%22%2C%22XDYUSM4Y%22%5D%2C%22dateModified%22%3A%222023-04-24T20%3A12%3A52Z%22%7D%7D%5D%7D
Dingemanse, M., & Liesenfeld, A. (2022). From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5614–5633. doi:
10.18653/v1/2022.acl-long.385 PDF