On learning and representing social meaning in NLP: a sociolinguistic perspective

Nguyen, Dong; Rosseel, Laura; Grieve, Jack

doi:10.18653/v1/2021.naacl-main.50

Cited by 15 publications

(17 citation statements)

References 74 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Language ideologies have an important, but often unacknowledged, influence on the development of NLP technologies (Blodgett et al, 2020). For example, an ideology that distinguishes between standard and non-standard language variations surfaces in text normalization tasks (van der Goot et al, 2021), which tend to strip documents of pragmatic nuance (Baldwin and Chai, 2011) and social signals (Nguyen et al, 2021). Language on the Internet has been historically treated as a noisy variant of English, even though lexical variation on the Internet is highly communicative of social signals (Eisenstein, 2013), and varies considerably along demographic variables (Eisenstein et al, 2014) and community membership (Lucy and Bamman, 2021).…”

Section: Related Workmentioning

confidence: 99%

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Gururangan¹,

Card²,

Dreier³

et al. 2022

Preprint

View full text Add to dashboard Cite

Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles-written by students from across the country-we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts. 1 We note that the term quality is often ill-defined in the NLP literature. For example, Brown et al. (2020) and refer to "high-quality text" or "high-quality sources"-both citing Wikipedia as an example-but without explaining precisely what is meant.

show abstract

Section: Related Workmentioning

confidence: 99%

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Gururangan¹,

Card²,

Dreier³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Some of the issues detailed in this paper may be attributed to a lack of language understanding, especially the social meaning of language (Hovy & Spruit, 2016;Flek, 2020;Hovy & Yang, 2021;Nguyen et al, 2021). See for example the discussion of the YEA-SAYER (ELIZA) EFFECT in §1.…”

Section: Natural Language Understandingmentioning

confidence: 99%

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

Dinan¹,

Abercrombie²,

Bergman³

et al. 2021

Preprint

View full text Add to dashboard Cite

Warning: this paper contains example data that may be offensive or upsetting.Over the last several years, end-to-end neural conversational agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large datasets from the internet, and as a result, may learn undesirable behaviors from this data, such as toxic or otherwise harmful language. Researchers must thus wrestle with the issue of how and when to release these models. In this paper, we survey the problem landscape for safety for end-to-end conversational AI and discuss recent and related work. We highlight tensions between values, potential positive impact and potential harms, and provide a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. We additionally provide a suite of tools to enable researchers to make better-informed decisions about training and releasing end-to-end conversational AI models.

show abstract

“…Although lexical normalization potentially removes social signals (Nguyen et al, 2021), it has also been shown to boost many downstream NLP tasks, including named entity recognition (Schulz et al, 2016;Plank et al, 2020), POS tagging (Derczynski et al, 2013;Schulz et al, 2016; Zupan et al, 2019), dependency and constituency parsing (Baldwin and Li, 2015;van der Goot et al, 2020;van der Goot and van Noord, 2017), sentiment analysis (Van Hee et al, 2017;Sidarenka, 2019, pp. 79, 122), and machine translation (Bhat et al, 2018).…”

Section: Definition -Lexical Normalizationmentioning

confidence: 99%

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Goot¹,

Ramponi²,

Zubiaga³

et al. 2021

Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-Nut 2021)

View full text Add to dashboard Cite

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MUL-TILEXNORM shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 12 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-ofspeech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system. 1

show abstract

On learning and representing social meaning in NLP: a sociolinguistic perspective

Cited by 15 publications

References 74 publications

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Contact Info

Product

Resources

About