Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.385
|View full text |Cite
|
Sign up to set email alerts
|

From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology

Abstract: Informal social interaction is the primordial home of human language. Linguistically diverse conversational corpora are an important and largely untapped resource for computational linguistics and language technology. Through the efforts of a worldwide language documentation movement, such corpora are increasingly becoming available. We show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action, with implications for language … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(10 citation statements)
references
References 69 publications
(84 reference statements)
0
5
0
Order By: Relevance
“…For the quantitative and inductive analysis that we envision, we need relatively large and maximally diverse language resources with time-aligned transcriptions. Rather than working with noninteractive data sources or collecting new data, here we explore the potential of language resources collected by the global language documentation movement [28,4]. We curate corpora of unscripted conversation made available in language documentation archives.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…For the quantitative and inductive analysis that we envision, we need relatively large and maximally diverse language resources with time-aligned transcriptions. Rather than working with noninteractive data sources or collecting new data, here we explore the potential of language resources collected by the global language documentation movement [28,4]. We curate corpora of unscripted conversation made available in language documentation archives.…”
Section: Methodsmentioning
confidence: 99%
“…In particular, we look for items that: (i) feature in the top decile of frequency counts by turn format per corpus; and (ii) occur at least once in a series of at least two produced by the same speaker. These search criteria reflect two basic observations about response tokens: their high frequency in naturally occurring talk [4], and the fact that they often occur in series of consecutive response tokens [33].…”
Section: Sequential Search Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Most turns in spoken conversation are short and most transitions between turns consist of very short gaps between speakers [4], which are preferred to other possible kinds of transition such as longer gaps or overlaps (two speakers talking at once). A modal value of around 200 milliseconds of silence between speakers has been shown for a wide range of languages and speakers, with only minor languagespecific variations [5][6][7][8].…”
Section: Introductionmentioning
confidence: 99%
“…Previous work has also shown that the rate of BCs produced and, more importantly, their specific lexical and intonational realisation, can have a profound influence on (perceived) communicative success and mutual understanding, as well as on subjective judgements by conversational partners. This has been explored both in the interactions of humans with virtual agents in spoken dialogue systems (Fujie et al, 2004;Ward & DeVault, 2016;Ward & Tsukahara, 1999) and in natural conversations, usually in cross-cultural or comparative settings (e.g., Cutrone, 2005Cutrone, , 2014Dingemanse & Liesenfeld, 2022;Li, 2006;Tottie, 1991;Xudong, 2008;Young & Lee, 2004).…”
mentioning
confidence: 99%