Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.98
|View full text |Cite
|
Sign up to set email alerts
|

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
54
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 79 publications
(54 citation statements)
references
References 39 publications
0
54
0
Order By: Relevance
“…Yet, we expect that this tool will introduce friction [16] to successfully posting inappropriate posts and act as a powerful deterrent against bad actors, thereby delimiting their submission of such posts. Prior research has shown that poorly implemented blocklists can disproportionately remove text from and about minority individuals and exacerbate existing inequalities [27,92,99]. We hope that FilterBuddy's analytic and visualization features to configure more accurate word filters would help minimize such harms.…”
Section: Limitations and Future Workmentioning
confidence: 97%
See 1 more Smart Citation
“…Yet, we expect that this tool will introduce friction [16] to successfully posting inappropriate posts and act as a powerful deterrent against bad actors, thereby delimiting their submission of such posts. Prior research has shown that poorly implemented blocklists can disproportionately remove text from and about minority individuals and exacerbate existing inequalities [27,92,99]. We hope that FilterBuddy's analytic and visualization features to configure more accurate word filters would help minimize such harms.…”
Section: Limitations and Future Workmentioning
confidence: 97%
“…This emphasizes the importance of offering technical resources-such as carefully curated lexicon-in addressing online hate. Recent work has shown how widely-adopted word filter lists such as the List of Dirty, Naughty, Obscene and Otherwise Bad Words (LDNOOBW) can harm marginalized groups, such as by censoring terms related to LGBTQ topics, due to the lack of input from members of those groups [27]. There is an opportunity here to involve minority support groups and use their domain expertise and influence to curate and publicize appropriate lexicons.…”
Section: Third-party Organizations and Advocacy Groupsmentioning
confidence: 99%
“…Many NLG datasets are similarly built on top of web-scrapes (e.g., news websites for summarization datasets or Wikipedia for data-to-text datasets) and often do not contain significant post-editing steps. As a result of this, pretraining examples can be found in downstream test corpora (Dodge et al, 2021;. Since it is impossible to remove the affected data from the training corpus after the release of a model, multiple approaches have been explored mitigation techniques.…”
Section: Representation In Performance Numbersmentioning
confidence: 99%
“…We point toPaullada et al (2020) for a more in-depth survey of general issues in data creation, including those of benchmarking and data maintenance practices, toBender et al (2021) for a survey issues of using large web-scraped datasets, and toLuccioni and Viviano (2021) andDodge et al (2021) for analyses of such large-scale web-scraped corpora and their representational, legal, consent, and PII issues.…”
mentioning
confidence: 99%
“…The documentation and curation of datasets have become a very active research area, and along with it, the detection of inappropriate material contained in datasets and reflected by deep models. Dodge et al [14] documented the very large C4 corpus with features such as 'text source' and 'content', arguing for different levels of documentation. They also address how C4 was created and show that this process removed texts from and about minorities.…”
Section: Issues Arising From Large Datasetsmentioning
confidence: 99%