Characterizing English Variation across Social Media Communities with
                    BERT

Lucy, Li; Bamman, David

doi:10.1162/tacl_a_00383

Cited by 18 publications

(33 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Defining domains in this way is intuitive and conveys a great deal about the type of language that can be expected in each document. Other accounts of domains (e.g., Lucy and Bamman, 2021;Gururangan et al, 2020) may be studied in future work. While other multidomain corpora (Koh et al, 2021;Gao et al, 2020) cover many more domains, our corpus is restricted to datasets with more permissive licensing to support reproducibility.…”

Section: Multi-domain Corpusmentioning

confidence: 99%

DEMix Layers: Disentangling Domains for Modular Language Modeling

Gururangan¹,

Lewis²,

Holtzman³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We introduce a new domain expert mixture (DEMIX) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMIX layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMIX layers reduce testtime perplexity (especially for out-of-domain data), increase training efficiency, and enable rapid adaptation. Mixing experts during inference, using a parameter-free weighted ensemble, enables better generalization to heterogeneous or unseen domains. We also show it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains. Overall, these results demonstrate benefits of domain modularity in language models.

show abstract

Section: Multi-domain Corpusmentioning

confidence: 99%

DEMix Layers: Disentangling Domains for Modular Language Modeling

Gururangan¹,

Lewis²,

Holtzman³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Machine-learning methods offer promising alternatives for selecting tweets. Neural language models such as BERT [49,50], for example, may provide solutions to the ambiguity problem by using linguistic context information [50]. However, these models, while very powerful, have so far only been applied to a limited number of linguistic categories.…”

Section: Discussionmentioning

confidence: 99%

Geolocation of multiple sociolinguistic markers in Buenos Aires

Kellert

Matlis

2022

PLoS ONE

View full text Add to dashboard Cite

Analysis of language geography is increasingly being used for studying spatial patterns of social dynamics. This trend is fueled by social media platforms such as Twitter which provide access to large amounts of natural language data combined with geolocation and user metadata enabling reconstruction of detailed spatial patterns of language use. Most studies are performed on large spatial scales associated with countries and regions, where language dynamics are often dominated by the effects of geographic and administrative borders. Extending to smaller, urban scales, however, allows visualization of spatial patterns of language use determined by social dynamics within the city, providing valuable information for a range of social topics from demographic studies to urban planning. So far, few studies have been made in this domain, due, in part, to the challenges in developing algorithms that accurately classify linguistic features. Here we extend urban-scale geographical analysis of language use beyond lexical meaning to include other sociolinguistic markers that identify language style, dialect and social groups. Some features, which have not been explored with social-media data on the urban scale, can be used to target a range of social phenomena. Our study focuses on Twitter use in Buenos Aires and our approach classifies tweets based on contrasting sets of tokens manually selected to target precise linguistic features. We perform statistical analyses of eleven categories of language use to quantify the presence of spatial patterns and the extent to which they are socially driven. We then perform the first comparative analysis assessing how the patterns and strength of social drivers vary with category. Finally, we derive plausible explanations for the patterns by comparing them with independently generated maps of geosocial context. Identifying these connections is a key aspect of the social-dynamics analysis which has so far received insufficient attention.

show abstract

“…For example, an ideology that distinguishes between standard and non-standard language variations surfaces in text normalization tasks (van der Goot et al, 2021), which tend to strip documents of pragmatic nuance (Baldwin and Chai, 2011) and social signals (Nguyen et al, 2021). Language on the Internet has been historically treated as a noisy variant of English, even though lexical variation on the Internet is highly communicative of social signals (Eisenstein, 2013), and varies considerably along demographic variables (Eisenstein et al, 2014) and community membership (Lucy and Bamman, 2021). Language ideologies also surface in tools for toxicity detection; for example, the classification behavior of the PERSPECTIVE API (a popular hate speech detector) aligns with the attitudes of conservative, white, female annotators, who tend to perceive African-American dialects as more toxic (Sap et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Gururangan¹,

Card²,

Dreier³

et al. 2022

Preprint

View full text Add to dashboard Cite

Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles-written by students from across the country-we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts. 1 We note that the term quality is often ill-defined in the NLP literature. For example, Brown et al. (2020) and refer to "high-quality text" or "high-quality sources"-both citing Wikipedia as an example-but without explaining precisely what is meant.

show abstract

Characterizing English Variation across Social Media Communities with BERT

Cited by 18 publications

References 49 publications

DEMix Layers: Disentangling Domains for Modular Language Modeling

DEMix Layers: Disentangling Domains for Modular Language Modeling

Geolocation of multiple sociolinguistic markers in Buenos Aires

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Contact Info

Product

Resources

About