Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-short.24
|View full text |Cite
|
Sign up to set email alerts
|

What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Abstract: Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 38 publications
(28 citation statements)
references
References 49 publications
0
15
0
Order By: Relevance
“…Therefore, we conduct a comprehensive safety evaluation of the aforementioned dialogue models. Key-word filtering (Xu et al, 2020;Roller et al, 2021;Luccioni and Viviano, 2021) and adopting classifiers trained on safety related datasets are both effective ways for safety evaluation. However, they may lose accuracy and completeness.…”
Section: Dialogue Safety Evaluationmentioning
confidence: 99%
“…Therefore, we conduct a comprehensive safety evaluation of the aforementioned dialogue models. Key-word filtering (Xu et al, 2020;Roller et al, 2021;Luccioni and Viviano, 2021) and adopting classifiers trained on safety related datasets are both effective ways for safety evaluation. However, they may lose accuracy and completeness.…”
Section: Dialogue Safety Evaluationmentioning
confidence: 99%
“…To deal with the demands of deep learning, data curators and researchers have turned to enormous internet-scraped datasets such as Common Crawl Corpus or WebText. As these unstructured corpora become larger, the risk of them containing harmful content increases, and the larger the dataset, the more difficult it is for humans to explore what is in the dataset and audit for quality or toxicity (Hanna and Park, 2020;Luccioni and Viviano, 2021;Kreutzer et al, 2022).…”
Section: Harms and Risks In Nlp Datamentioning
confidence: 99%
“…However, work in seemingly unrelated NLP domains (e.g. NLG, part-of-speech tagging, or semantic search) may still encounter spurious harms in datasets, especially if these are large-scale and scraped from internet sources (Luccioni and Viviano, 2021;Dodge et al, 2021;Kreutzer et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
“…For instance, Wikipedia is highly biased in terms of the topics covered and in terms of the demographics of its contributors, particularly for gender, race, and geography (Barera, 2020), resulting in similar concerns of representation in technologies developed on Wikipedia data. Common Crawl, meanwhile, has been shown to contain hate speech and over-represent sexually explicit content (Luccioni and Viviano, 2021), and typical web-crawling collection practices have no structures for supporting informed consent beyond websites' own terms and conditions policies that users rarely read (Cakebread, 2017;Obar and Oeldorf-Hirsch, 2020). Several documentation schemas for natural language processing (NLP) datasets (Bender and Friedman, 2018;Gebru et al, 2018;Gebru et al, 2021;Holland et al, 2018;Pushkarna et al, 2021) have been recently produced to aid NLP researchers in documenting their own datasets (Gao et al, 2020;Biderman et al, 1 http://commoncrawl.org/ 2022; Gehrmann et al, 2021;Wang et al, 2021) and even to retrospectively document and analyze datasets that were developed and released by others without thorough documentation (Bandy and Vincent, 2021;Kreutzer et al, 2021;Birhane et al, 2021;Dodge et al, 2021).…”
Section: Introductionmentioning
confidence: 99%