What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Luccioni, Alexandra Sasha; Viviano, Joseph D.

doi:10.48550/arxiv.2105.02732

Cited by 14 publications

(12 citation statements)

References 49 publications

(57 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, words referring to social groups and identities (e.g., "gay"), may be coded as not only highly semantically related to the Social Groups dimension, but also as relatively highly related to the Morality dimension. Such results may reflect the fact that moralizing language is often used to discuss social groups in word embeddings training data (e.g., the Common Crawl; Luccioni & Viviano, 2021) and that social group labels are often associated with cultural biases (e.g., the use of "gay" as a general negative term; Nicolas & Skinner, 2012). The dictionaries, developed through a literature search (Nicolas et al, 2021) and lexical expansion based on more vetted data (Wordnet; Fellbaum, 1998) may be much less susceptible to these biases, although by no means eliminate them.…”

Section: Discussionmentioning

confidence: 99%

A spontaneous stereotype content model: Taxonomy, properties, and prediction.

Nicolas¹,

Bai²

2022

Journal of Personality and Social Psychology

View full text Add to dashboard Cite

The spontaneous stereotype content model (SSCM) describes a comprehensive taxonomy, with associated properties and predictive value, of social-group beliefs that perceivers report in open-ended responses. Four studies (N = 1,470) show the utility of spontaneous stereotypes, compared to traditional, prompted, scale-based stereotypes. Using natural language processing text analyses, Study 1 shows the most common spontaneous stereotype dimensions for salient social groups. Our results confirm existing stereotype models' dimensions, while uncovering a significant prevalence of dimensions that these models do not cover, such as Health, Appearance, and Deviance. The SSCM also characterizes the valence, direction, and accessibility of reported dimensions (e.g., Ability stereotypes are mostly positive, but Morality stereotypes are mostly negative; Sociability stereotypes are provided later than Ability stereotypes in a sequence of open-ended responses). Studies 2 and 3 check the robustness of these findings by: using a larger sample of social groups, varying time pressure, and diversifying analytical strategies. Study 3 also establishes the value of spontaneous stereotypes: compared to scales alone, open-ended measures improve predictions of attitudes toward social groups. Improvement in attitude prediction results partially from a more comprehensive taxonomy as well as a construct we refer to as stereotype representativeness: the prevalence of a stereotype dimension in perceivers' spontaneous beliefs about a social group. Finally, Study 4 examines how the taxonomy provides additional insight into stereotypes' influence on decision-making in socially relevant scenarios. Overall, spontaneous content broadens our understanding of stereotyping and intergroup relations.

show abstract

Section: Discussionmentioning

confidence: 99%

A spontaneous stereotype content model: Taxonomy, properties, and prediction.

Nicolas¹,

Bai²

2022

Journal of Personality and Social Psychology

View full text Add to dashboard Cite

show abstract

“…The Realtoxicityprompts work [38] revealed that CommonCrawl contained over 300,000 documents from unreliable news sites and banned subReddit pages containing hate speech and racism. More recently, Luccioni and Viviano's initial study [39] placed the 'Hate speech' content level to be around 4.02%-5.24% (the 1+ hate n-grams level was estimated higher at 17.78%). With regards to CCAligned, a 119-language parallel dataset built off 68 snapshots of Common Crawl, Caswell et al [40] revealed that there were notable amounts of pornographic content (> 10%) found for 11 languages with prevalence rates being as high as 24% for language pairs such as en-om_KE.…”

Section: The Common-crawlmentioning

confidence: 99%

Multimodal datasets: misogyny, pornography, and malignant stereotypes

Birhane,

Prabhu,

Kahembwe

2021

Preprint

View full text Add to dashboard Cite

We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset often used as a source for training large language models, and the entrenched biases in large-scale visio-linguistic models (such as OpenAI's CLIP model) trained on opaque datasets (WebImageText). In the backdrop of these specific calls of caution, we examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset. We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. We outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the AI community, regulators, policy makers and data subjects. Warning: This paper contains NSFW content that some readers may find disturbing, distressing, and/or offensive.

show abstract

“…Toxic speech is a widespread problem on online platforms (Duggan, 2017;Gorwa et al, 2020) and in training corpora such as (Gehman et al, 2020;Luccioni and Viviano, 2021;Radford et al, 2018b). Moreover, the problem of toxic speech online platforms from LMs is not easy to address.…”

Section: Problemmentioning

confidence: 99%

Ethical and social risks of harm from Language Models

Weidinger¹,

Mellor²,

Rauh³

et al. 2021

Preprint

105

View full text Add to dashboard Cite

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary literature from computer science, linguistics, and social sciences.

show abstract

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Cited by 14 publications

References 49 publications

A spontaneous stereotype content model: Taxonomy, properties, and prediction.

A spontaneous stereotype content model: Taxonomy, properties, and prediction.

Multimodal datasets: misogyny, pornography, and malignant stereotypes

Ethical and social risks of harm from Language Models

Contact Info

Product

Resources

About