2021
DOI: 10.1162/tacl_a_00383
|View full text |Cite
|
Sign up to set email alerts
|

Characterizing English Variation across Social Media Communities with BERT

Abstract: Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(33 citation statements)
references
References 49 publications
0
9
0
Order By: Relevance
“…Defining domains in this way is intuitive and conveys a great deal about the type of language that can be expected in each document. Other accounts of domains (e.g., Lucy and Bamman, 2021;Gururangan et al, 2020) may be studied in future work. While other multidomain corpora (Koh et al, 2021;Gao et al, 2020) cover many more domains, our corpus is restricted to datasets with more permissive licensing to support reproducibility.…”
Section: Multi-domain Corpusmentioning
confidence: 99%
“…Defining domains in this way is intuitive and conveys a great deal about the type of language that can be expected in each document. Other accounts of domains (e.g., Lucy and Bamman, 2021;Gururangan et al, 2020) may be studied in future work. While other multidomain corpora (Koh et al, 2021;Gao et al, 2020) cover many more domains, our corpus is restricted to datasets with more permissive licensing to support reproducibility.…”
Section: Multi-domain Corpusmentioning
confidence: 99%
“…Machine-learning methods offer promising alternatives for selecting tweets. Neural language models such as BERT [49,50], for example, may provide solutions to the ambiguity problem by using linguistic context information [50]. However, these models, while very powerful, have so far only been applied to a limited number of linguistic categories.…”
Section: Discussionmentioning
confidence: 99%
“…For example, an ideology that distinguishes between standard and non-standard language variations surfaces in text normalization tasks (van der Goot et al, 2021), which tend to strip documents of pragmatic nuance (Baldwin and Chai, 2011) and social signals (Nguyen et al, 2021). Language on the Internet has been historically treated as a noisy variant of English, even though lexical variation on the Internet is highly communicative of social signals (Eisenstein, 2013), and varies considerably along demographic variables (Eisenstein et al, 2014) and community membership (Lucy and Bamman, 2021). Language ideologies also surface in tools for toxicity detection; for example, the classification behavior of the PERSPECTIVE API (a popular hate speech detector) aligns with the attitudes of conservative, white, female annotators, who tend to perceive African-American dialects as more toxic (Sap et al, 2021).…”
Section: Related Workmentioning
confidence: 99%