What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Luccioni, Alexandra

doi:10.18653/v1/2021.acl-short.24

Cited by 38 publications

(28 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we conduct a comprehensive safety evaluation of the aforementioned dialogue models. Key-word filtering (Xu et al, 2020;Roller et al, 2021;Luccioni and Viviano, 2021) and adopting classifiers trained on safety related datasets are both effective ways for safety evaluation. However, they may lose accuracy and completeness.…”

Section: Dialogue Safety Evaluationmentioning

confidence: 99%

PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model

Mi¹,

Li²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Warning: this paper contains contents that are offensive or upsetting in nature.In this paper, we introduce PANGUBOT, a Chinese pre-trained open-domain dialogue generation model based on a large pre-trained language model (PLM) PANGU-α (Zeng et al., 2021). Different from other pre-trained dialogue models trained over a massive amount of dialogue data from scratch, we aim to build a powerful dialogue model with relatively fewer data and computation costs by inheriting valuable language capabilities and knowledge from PLMs. To this end, we train PANGUBOT from the large PLM PANGU-α, which has been proven well-performed on a variety of Chinese natural language tasks. We investigate different aspects of responses generated by PANGUBOT, including response quality, knowledge, and safety. We show that PANGUBOT outperforms state-of-the-art Chinese dialogue systems (CDIALGPT ), EVA (Zhou et al., 2021) w.r.t. the above three aspects. We also demonstrate that PANGUBOT can be easily deployed to generate emotional responses without further training. Throughout our empirical analysis, we also point out that the PANGUBOT's response quality, knowledge correctness, and safety are still far from perfect, and further explorations are indispensable to building reliable and smart dialogue systems. 1 * Equal contribution 1 Our model and code will be available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/ PanGu-Bot soon.

show abstract

Section: Dialogue Safety Evaluationmentioning

confidence: 99%

PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model

Mi¹,

Li²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To deal with the demands of deep learning, data curators and researchers have turned to enormous internet-scraped datasets such as Common Crawl Corpus or WebText. As these unstructured corpora become larger, the risk of them containing harmful content increases, and the larger the dataset, the more difficult it is for humans to explore what is in the dataset and audit for quality or toxicity (Hanna and Park, 2020;Luccioni and Viviano, 2021;Kreutzer et al, 2022).…”

Section: Harms and Risks In Nlp Datamentioning

confidence: 99%

“…However, work in seemingly unrelated NLP domains (e.g. NLG, part-of-speech tagging, or semantic search) may still encounter spurious harms in datasets, especially if these are large-scale and scraped from internet sources (Luccioni and Viviano, 2021;Dodge et al, 2021;Kreutzer et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

Handling and Presenting Harmful Text

Derczynski¹,

Kirk²,

Birhane³

et al. 2022

Preprint

View full text Add to dashboard Cite

Textual data can pose a risk of serious harm. These harms can be categorised along three axes: (1) the harm type (e.g. misinformation, hate speech or racial stereotypes) (2) whether it is elicited as a feature of the research design from directly studying harmful content (e.g. training a hate speech classifier or auditing unfiltered large-scale datasets) versus spuriously invoked from working on unrelated problems (e.g. language generation or part of speech tagging) but with datasets that nonetheless contain harmful content, and (3) who it affects, from the humans (mis)represented in the data to those handling or labelling the data to readers and reviewers of publications produced from the data. It is an unsolved problem in NLP as to how textual harms should be handled, presented, and discussed; but, stopping work on content which poses a risk of harm is untenable. Accordingly, we provide practical advice and introduce HARMCHECK, a resource for reflecting on research into textual harms. We hope our work encourages ethical, responsible, and respectful research in the NLP community.

show abstract

“…For instance, Wikipedia is highly biased in terms of the topics covered and in terms of the demographics of its contributors, particularly for gender, race, and geography (Barera, 2020), resulting in similar concerns of representation in technologies developed on Wikipedia data. Common Crawl, meanwhile, has been shown to contain hate speech and over-represent sexually explicit content (Luccioni and Viviano, 2021), and typical web-crawling collection practices have no structures for supporting informed consent beyond websites' own terms and conditions policies that users rarely read (Cakebread, 2017;Obar and Oeldorf-Hirsch, 2020). Several documentation schemas for natural language processing (NLP) datasets (Bender and Friedman, 2018;Gebru et al, 2018;Gebru et al, 2021;Holland et al, 2018;Pushkarna et al, 2021) have been recently produced to aid NLP researchers in documenting their own datasets (Gao et al, 2020;Biderman et al, 1 http://commoncrawl.org/ 2022; Gehrmann et al, 2021;Wang et al, 2021) and even to retrospectively document and analyze datasets that were developed and released by others without thorough documentation (Bandy and Vincent, 2021;Kreutzer et al, 2021;Birhane et al, 2021;Dodge et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

McMillan-Major¹,

Alyafeai²,

Biderman³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

show abstract

What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Cited by 38 publications

References 49 publications

PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model

PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model

Handling and Presenting Harmful Text

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Contact Info

Product

Resources

About