2022
DOI: 10.48550/arxiv.2201.07311
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Datasheet for the Pile

Abstract: This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online. Background on the PileThe Pile is a massive text corpus created by EleutherAI for large-scale language modeling efforts. It is comprised of textual data from 22 sources (see below)… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 28 publications
(34 reference statements)
0
7
0
Order By: Relevance
“…For LLMs in particular, the data used to train them are one further step removed from the task-specific models built from them, so the link between data and ML progress is even more abstracted [87,116]. Second, research addressing dataset choices, creation, and curation, is systematically "under-valued and de-glamorised" [3,123] 16 . Even works that do include significant curation efforts for the sake of improving models [57,113] focus on definitions of quality that prioritize technical performance over the agency of data and algorithm subjects, which can result in widespread data that proliferates misogyny, pornography without consent, and malignant stereotypes [19].…”
Section: Machine Learning Context: Challenges and Incentivesmentioning
confidence: 99%
See 3 more Smart Citations
“…For LLMs in particular, the data used to train them are one further step removed from the task-specific models built from them, so the link between data and ML progress is even more abstracted [87,116]. Second, research addressing dataset choices, creation, and curation, is systematically "under-valued and de-glamorised" [3,123] 16 . Even works that do include significant curation efforts for the sake of improving models [57,113] focus on definitions of quality that prioritize technical performance over the agency of data and algorithm subjects, which can result in widespread data that proliferates misogyny, pornography without consent, and malignant stereotypes [19].…”
Section: Machine Learning Context: Challenges and Incentivesmentioning
confidence: 99%
“…One approach put forward in recent years to foster more accountability of these data practices has been documentation standards for data and models in natural language processing [11,59] and ML in general [93]. There has also been an increased focus on analyzing other dimensions of data quality and stewardship [102,107,121,123], with several noteworthy initiatives aiming to document both existing [9,20,30,43], and newly developed [16,60,136] resources.…”
Section: Machine Learning Context: Challenges and Incentivesmentioning
confidence: 99%
See 2 more Smart Citations
“…Common Crawl, meanwhile, has been shown to contain hate speech and over-represent sexually explicit content (Luccioni and Viviano, 2021), and typical web-crawling collection practices have no structures for supporting informed consent beyond websites' own terms and conditions policies that users rarely read (Cakebread, 2017;Obar and Oeldorf-Hirsch, 2020). Several documentation schemas for natural language processing (NLP) datasets (Bender and Friedman, 2018;Gebru et al, 2018;Gebru et al, 2021;Holland et al, 2018;Pushkarna et al, 2021) have been recently produced to aid NLP researchers in documenting their own datasets (Gao et al, 2020;Biderman et al, 1 http://commoncrawl.org/ 2022; Gehrmann et al, 2021;Wang et al, 2021) and even to retrospectively document and analyze datasets that were developed and released by others without thorough documentation (Bandy and Vincent, 2021;Kreutzer et al, 2021;Birhane et al, 2021;Dodge et al, 2021). Data documentation to support transparency has gained traction following calls for a reevaluation of the treatment of data in machine learning (ML) at large (Prabhu and Birhane, 2020;Jo and Gebru, 2020;Paullada et al, 2021;Gebru et al, 2021).…”
Section: Introductionmentioning
confidence: 99%