Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Dodge, Jesse; Sap, Maarten; Marasović, Ana; Agnew, William; Ilharco, Gabriel; Groeneveld, Dirk; Gardner, Matt

doi:10.18653/v1/2021.emnlp-main.98

Cited by 79 publications

(54 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yet, we expect that this tool will introduce friction [16] to successfully posting inappropriate posts and act as a powerful deterrent against bad actors, thereby delimiting their submission of such posts. Prior research has shown that poorly implemented blocklists can disproportionately remove text from and about minority individuals and exacerbate existing inequalities [27,92,99]. We hope that FilterBuddy's analytic and visualization features to configure more accurate word filters would help minimize such harms.…”

Section: Limitations and Future Workmentioning

confidence: 97%

“…This emphasizes the importance of offering technical resources-such as carefully curated lexicon-in addressing online hate. Recent work has shown how widely-adopted word filter lists such as the List of Dirty, Naughty, Obscene and Otherwise Bad Words (LDNOOBW) can harm marginalized groups, such as by censoring terms related to LGBTQ topics, due to the lack of input from members of those groups [27]. There is an opportunity here to involve minority support groups and use their domain expertise and influence to curate and publicize appropriate lexicons.…”

Section: Third-party Organizations and Advocacy Groupsmentioning

confidence: 99%

See 1 more Smart Citation

Designing Word Filter Tools for Creator-led Comment Moderation

Jhaver,

Chen,

Knauss

et al. 2022

Preprint

View full text Add to dashboard Cite

Online social platforms centered around content creators often allow comments on content, where creators moderate the comments they receive. As creators can face overwhelming numbers of comments, with some of them harassing or hateful, platforms typically provide tools such as word filters for creators to automate aspects of moderation. From needfinding interviews with 19 creators about how they use existing tools, we found that they struggled with writing good filters as well as organizing and revisiting their filters, due to the difficulty of determining what the filters actually catch. To address these issues, we present FilterBuddy, a system that supports creators in authoring new filters or building from existing filter lists, as well as organizing their filters and visualizing what comments are captured over time. We conducted an early-stage evaluation of FilterBuddy with YouTube creators, finding that participants see FilterBuddy not just as a moderation tool, but also a means to organize their comments to better understand their audiences.CCS Concepts: • Human-centered computing → Empirical studies in collaborative and social computing.

show abstract

Section: Limitations and Future Workmentioning

confidence: 97%

Section: Third-party Organizations and Advocacy Groupsmentioning

confidence: 99%

Designing Word Filter Tools for Creator-led Comment Moderation

Jhaver,

Chen,

Knauss

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Many NLG datasets are similarly built on top of web-scrapes (e.g., news websites for summarization datasets or Wikipedia for data-to-text datasets) and often do not contain significant post-editing steps. As a result of this, pretraining examples can be found in downstream test corpora (Dodge et al, 2021;. Since it is impossible to remove the affected data from the training corpus after the release of a model, multiple approaches have been explored mitigation techniques.…”

Section: Representation In Performance Numbersmentioning

confidence: 99%

“…We point toPaullada et al (2020) for a more in-depth survey of general issues in data creation, including those of benchmarking and data maintenance practices, toBender et al (2021) for a survey issues of using large web-scraped datasets, and toLuccioni and Viviano (2021) andDodge et al (2021) for analyses of such large-scale web-scraped corpora and their representational, legal, consent, and PII issues.…”

mentioning

confidence: 99%

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann¹,

Clark²,

Sellam³

2022

Preprint

View full text Add to dashboard Cite

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surfacelevel features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

show abstract

“…The documentation and curation of datasets have become a very active research area, and along with it, the detection of inappropriate material contained in datasets and reflected by deep models. Dodge et al [14] documented the very large C4 corpus with features such as 'text source' and 'content', arguing for different levels of documentation. They also address how C4 was created and show that this process removed texts from and about minorities.…”

Section: Issues Arising From Large Datasetsmentioning

confidence: 99%

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Schramowski¹,

Tauchmann²,

Kersting³

2022

Preprint

View full text Add to dashboard Cite

This paper contains images and descriptions that are offensive in nature.Large datasets underlying much of current machine learning raise serious issues concerning inappropriate content such as offensive, insulting, threatening, or might otherwise cause anxiety. This calls for increased dataset documentation, e.g., using datasheets. They, among other topics, encourage to reflect on the composition of the datasets. So far, this documentation, however, is done manually and therefore can be tedious and error-prone, especially for large image datasets. Here we ask the arguably "circular" question of whether a machine can help us reflect on inappropriate content, answering Question 16 in Datasheets. To this end, we propose to use the information stored in pre-trained transformer models to assist us in the documentation process. Specifically, prompt-tuning based on a dataset of socio-moral values steers CLIP to identify potentially inappropriate content, therefore reducing human labor. We then document the inappropriate images found using word clouds, based on captions generated using a vision-language model. The documentations of two popular, large-scale computer vision datasets-ImageNet and OpenImages-produced this way suggest that machines can indeed help dataset creators to answer Question 16 on inappropriate image content.

show abstract

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Cited by 79 publications

References 39 publications

Designing Word Filter Tools for Creator-led Comment Moderation

Designing Word Filter Tools for Creator-led Comment Moderation

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Contact Info

Product

Resources

About