2021
DOI: 10.48550/arxiv.2105.02732
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Alexandra Sasha Luccioni,
Joseph D. Viviano

Abstract: Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 14 publications
(12 citation statements)
references
References 49 publications
(57 reference statements)
0
12
0
Order By: Relevance
“…For example, words referring to social groups and identities (e.g., "gay"), may be coded as not only highly semantically related to the Social Groups dimension, but also as relatively highly related to the Morality dimension. Such results may reflect the fact that moralizing language is often used to discuss social groups in word embeddings training data (e.g., the Common Crawl; Luccioni & Viviano, 2021) and that social group labels are often associated with cultural biases (e.g., the use of "gay" as a general negative term; Nicolas & Skinner, 2012). The dictionaries, developed through a literature search (Nicolas et al, 2021) and lexical expansion based on more vetted data (Wordnet; Fellbaum, 1998) may be much less susceptible to these biases, although by no means eliminate them.…”
Section: Discussionmentioning
confidence: 99%
“…For example, words referring to social groups and identities (e.g., "gay"), may be coded as not only highly semantically related to the Social Groups dimension, but also as relatively highly related to the Morality dimension. Such results may reflect the fact that moralizing language is often used to discuss social groups in word embeddings training data (e.g., the Common Crawl; Luccioni & Viviano, 2021) and that social group labels are often associated with cultural biases (e.g., the use of "gay" as a general negative term; Nicolas & Skinner, 2012). The dictionaries, developed through a literature search (Nicolas et al, 2021) and lexical expansion based on more vetted data (Wordnet; Fellbaum, 1998) may be much less susceptible to these biases, although by no means eliminate them.…”
Section: Discussionmentioning
confidence: 99%
“…The Realtoxicityprompts work [38] revealed that CommonCrawl contained over 300,000 documents from unreliable news sites and banned subReddit pages containing hate speech and racism. More recently, Luccioni and Viviano's initial study [39] placed the 'Hate speech' content level to be around 4.02%-5.24% (the 1+ hate n-grams level was estimated higher at 17.78%). With regards to CCAligned, a 119-language parallel dataset built off 68 snapshots of Common Crawl, Caswell et al [40] revealed that there were notable amounts of pornographic content (> 10%) found for 11 languages with prevalence rates being as high as 24% for language pairs such as en-om_KE.…”
Section: The Common-crawlmentioning
confidence: 99%
“…Toxic speech is a widespread problem on online platforms (Duggan, 2017;Gorwa et al, 2020) and in training corpora such as (Gehman et al, 2020;Luccioni and Viviano, 2021;Radford et al, 2018b). Moreover, the problem of toxic speech online platforms from LMs is not easy to address.…”
Section: Problemmentioning
confidence: 99%