2021
DOI: 10.48550/arxiv.2104.08758
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 0 publications
0
9
0
Order By: Relevance
“…However, such web-scale corpora are known to be noisy and contain undesirable content [53,48,21], with their multilingual partitions often having their own specific issues such as unusable text, misaligned and mislabeled/ambiguously labeled data [40]. To mitigate this, we manually audit our data.…”
Section: Introductionmentioning
confidence: 99%
“…However, such web-scale corpora are known to be noisy and contain undesirable content [53,48,21], with their multilingual partitions often having their own specific issues such as unusable text, misaligned and mislabeled/ambiguously labeled data [40]. To mitigate this, we manually audit our data.…”
Section: Introductionmentioning
confidence: 99%
“…To test whether Sim(people, men) > Sim(people, women) at the level of collective concepts, we used word embeddings (13) extracted from the May 2017 Common Crawl corpus [CC-MAIN-2017-22; (41)], which contains a large cross section of the internet: over 630 billion words from 2.96 billion web pages and 250 uncompressed TiB of content. Although the Common Crawl is not accompanied by documentation about its contents, it likely includes informal text (e.g., blogs and discussion forums) written by many individuals, as well as more formal text written by the media, corporations, and governments, mostly in English (42,43). Using word embeddings extracted from this massive corpus, we computed the similarity in linguistic context between words-a proxy for the similarity between the concepts denoted-as the cosine of the angle between corresponding embeddings in vector space, or cosine similarity.…”
Section: Resultsmentioning
confidence: 99%
“…The May 2017 Common Crawl is a large collection of over 630 billion tokens (roughly, words) and contains 2.96+ billion web pages and over 250 uncompressed TiB of content (41). Recent investigations of the Common Crawl suggest that most of this corpus is written in English and based on webpages generated within a year or two of their inclusion in the corpus (43). The most prevalent 25 websites in the 2019 version include websites on patent filings, news coverage, and peer-reviewed scientific publications (43), but more informal content such as travel blogs and personal websites are also represented (42).…”
Section: Word Embeddings (Step 2)mentioning
confidence: 99%
“…One format that has been proposed for such dataset documentation (Bender and Friedman, 2018) are 'Datasheets' . Some work in this direction includes documentation on the Colossal Clean Crawl Corpus (C4) that highlights the most prominently represented sources and references to help illuminate whose biases are likely to be encoded in the dataset (Dodge et al, 2021). Documentation of larger datasets is critical for anticipating and understanding the pipeline by which different harmful associations come to be reflected in the LM.…”
Section: Documentation Of Biases In Training Corporamentioning
confidence: 99%
“…. (2020);Caliskan et al (2017);Dodge et al (2021);Ferrer et al (2020);Zhao et al (2017) Abid et al (2021;Huang et al (2020);Lucy and Bamman (2021);Nadeem et al (2020);Nangia et al (2020);Nozza et al (2021) 2.1.3 Exclusionary norms Cao and Daumé III (2020) 2.1.4 Toxic language Duggan (2017); Gehman et al (2020); Gorwa et al (2020); Luccioni and Viviano (2021) Rae et al (2021); Wallace et al (2020) 2.1.5 Lower performance by social group Blodgett and O'Connor (2017); Blodgett et al (2016); Joshi et al (2021); Koenecke et al (2020); Ruder (2020) Winata et al (2021) . (2018); Golbeck (2018); Makazhanov et al (2014); Morgan-Lopez et al (2017); Nguyen et al (2013); Park et al (2015); Preoţiuc-Pietro et al (2017)…”
mentioning
confidence: 99%