Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Dodge, Jesse; Sap, Maarten; Marasović, Ana; Agnew, William S.; Ilharco, Gabriel; Groeneveld, Dirk; Gardner, Matt

doi:10.48550/arxiv.2104.08758

Cited by 5 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, such web-scale corpora are known to be noisy and contain undesirable content [53,48,21], with their multilingual partitions often having their own specific issues such as unusable text, misaligned and mislabeled/ambiguously labeled data [40]. To mitigate this, we manually audit our data.…”

Section: Introductionmentioning

confidence: 99%

Investigating Multilingual NMT Representations at Scale

Kudugunta

Bapna

Caswell

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Multilingual Neural Machine Translation (NMT) models have yielded large empirical success in transfer learning settings. However, these black-box representations are poorly understood, and their mode of transfer remains elusive. In this work, we attempt to understand massively multilingual NMT representations (with 103 languages) using Singular Value Canonical Correlation Analysis (SVCCA), a representation similarity framework that allows us to compare representations across different languages, layers and models. Our analysis validates several empirical results and long-standing intuitions, and unveils new observations regarding how representations evolve in a multilingual translation model. We draw three major conclusions from our analysis, with implications on cross-lingual transfer learning: (i) Encoder representations of different languages cluster based on linguistic similarity, (ii) Representations of a source language learned by the encoder are dependent on the target language, and vice-versa, and (iii) Representations of high resource and/or linguistically similar languages are more robust when fine-tuning on an arbitrary language pair, which is critical to determining how much cross-lingual transfer can be expected in a zero or few-shot setting. We further connect our findings with existing empirical observations in multilingual NMT and transfer learning.

show abstract

Section: Introductionmentioning

confidence: 99%

Investigating Multilingual NMT Representations at Scale

Kudugunta

Bapna

Caswell

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…To test whether Sim(people, men) > Sim(people, women) at the level of collective concepts, we used word embeddings (13) extracted from the May 2017 Common Crawl corpus [CC-MAIN-2017-22; (41)], which contains a large cross section of the internet: over 630 billion words from 2.96 billion web pages and 250 uncompressed TiB of content. Although the Common Crawl is not accompanied by documentation about its contents, it likely includes informal text (e.g., blogs and discussion forums) written by many individuals, as well as more formal text written by the media, corporations, and governments, mostly in English (42,43). Using word embeddings extracted from this massive corpus, we computed the similarity in linguistic context between words-a proxy for the similarity between the concepts denoted-as the cosine of the angle between corresponding embeddings in vector space, or cosine similarity.…”

Section: Resultsmentioning

confidence: 99%

“…The May 2017 Common Crawl is a large collection of over 630 billion tokens (roughly, words) and contains 2.96+ billion web pages and over 250 uncompressed TiB of content (41). Recent investigations of the Common Crawl suggest that most of this corpus is written in English and based on webpages generated within a year or two of their inclusion in the corpus (43). The most prevalent 25 websites in the 2019 version include websites on patent filings, news coverage, and peer-reviewed scientific publications (43), but more informal content such as travel blogs and personal websites are also represented (42).…”

Section: Word Embeddings (Step 2)mentioning

confidence: 99%

Based on billions of words on the internet, people = men

2022

View full text Add to dashboard Cite

Recent advances have made it possible to precisely measure the extent to which any two words are used in similar contexts. In turn, this measure of similarity in linguistic context also captures the extent to which the concepts being denoted are similar. When extracted from massive corpora of text written by millions of individuals, this measure of linguistic similarity can provide insight into the collective concepts of a linguistic community, concepts that both reflect and reinforce widespread ways of thinking. Using this approach, we investigated the collective concept person / people , which forms the basis for nearly all societal decision- and policy-making. In three studies and three preregistered replications with similarity metrics extracted from a corpus of over 630 billion English words, we found that the collective concept person / people is not gender-neutral but rather prioritizes men over women—a fundamental bias in our species’ collective view of itself.

show abstract

“…One format that has been proposed for such dataset documentation (Bender and Friedman, 2018) are 'Datasheets' . Some work in this direction includes documentation on the Colossal Clean Crawl Corpus (C4) that highlights the most prominently represented sources and references to help illuminate whose biases are likely to be encoded in the dataset (Dodge et al, 2021). Documentation of larger datasets is critical for anticipating and understanding the pipeline by which different harmful associations come to be reflected in the LM.…”

Section: Documentation Of Biases In Training Corporamentioning

confidence: 99%

“…. (2020);Caliskan et al (2017);Dodge et al (2021);Ferrer et al (2020);Zhao et al (2017) Abid et al (2021;Huang et al (2020);Lucy and Bamman (2021);Nadeem et al (2020);Nangia et al (2020);Nozza et al (2021) 2.1.3 Exclusionary norms Cao and Daumé III (2020) 2.1.4 Toxic language Duggan (2017); Gehman et al (2020); Gorwa et al (2020); Luccioni and Viviano (2021) Rae et al (2021); Wallace et al (2020) 2.1.5 Lower performance by social group Blodgett and O'Connor (2017); Blodgett et al (2016); Joshi et al (2021); Koenecke et al (2020); Ruder (2020) Winata et al (2021) . (2018); Golbeck (2018); Makazhanov et al (2014); Morgan-Lopez et al (2017); Nguyen et al (2013); Park et al (2015); Preoţiuc-Pietro et al (2017)…”

mentioning

confidence: 99%

Ethical and social risks of harm from Language Models

Weidinger¹,

Mellor²,

Rauh³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary literature from computer science, linguistics, and social sciences.

show abstract

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Cited by 5 publications

References 0 publications

Investigating Multilingual NMT Representations at Scale

Investigating Multilingual NMT Representations at Scale

Based on billions of words on the internet, people = men

Ethical and social risks of harm from Language Models

Contact Info

Product

Resources

About