Datasheet for the Pile

Biderman, Stella; Bicheno, Kieran; Gao, Leo

doi:10.48550/arxiv.2201.07311

Cited by 5 publications

(10 citation statements)

References 28 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For LLMs in particular, the data used to train them are one further step removed from the task-specific models built from them, so the link between data and ML progress is even more abstracted [87,116]. Second, research addressing dataset choices, creation, and curation, is systematically "under-valued and de-glamorised" [3,123] 16 . Even works that do include significant curation efforts for the sake of improving models [57,113] focus on definitions of quality that prioritize technical performance over the agency of data and algorithm subjects, which can result in widespread data that proliferates misogyny, pornography without consent, and malignant stereotypes [19].…”

Section: Machine Learning Context: Challenges and Incentivesmentioning

confidence: 99%

“…One approach put forward in recent years to foster more accountability of these data practices has been documentation standards for data and models in natural language processing [11,59] and ML in general [93]. There has also been an increased focus on analyzing other dimensions of data quality and stewardship [102,107,121,123], with several noteworthy initiatives aiming to document both existing [9,20,30,43], and newly developed [16,60,136] resources.…”

Section: Machine Learning Context: Challenges and Incentivesmentioning

confidence: 99%

“…15 The current proposal for the EU AI Act [50] distinguishes between application areas on the basis of risk they pose, and would institute external "conformity assessments" for the more risky applications. 16 For a direct example of how the ML community treats work on datasets and values, see reviews for [3] here 17 mostly geared towards code and experiment tracking, but also covering training and evaluation data 10 Data Governance in the Age of Large-Scale Data-Driven Language Technology FAcct '22, 2022, Seoul, South Korea little attention is paid to the later stages of the data life cycle (see Section 2.3), or to data management models that intentionally include data subjects. We review common approaches to hosting and distributing ML data in Section 4.…”

Section: Machine Learning Context: Challenges and Incentivesmentioning

confidence: 99%

“…SURVIVAL The provisions set forth in section 6(b) (Limits of Liability), 10 (Term and Termination), 12 (Confidentiality), 15 (Governing Law), 16 (Survival) and Exhibit A (Section Restrictions of Use in the Dataset section) shall survive the termination of this agreement and continue to bind both parties. 16. SEVERABILITY If any provision of this Agreement is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.…”

Section: Dispute Resolutionmentioning

confidence: 99%

See 3 more Smart Citations

Data Governance in the Age of Large-Scale Data-Driven Language Technology

Jernite,

Nguyen,

Biderman

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.

show abstract

Section: Machine Learning Context: Challenges and Incentivesmentioning

confidence: 99%

Section: Machine Learning Context: Challenges and Incentivesmentioning

confidence: 99%

Section: Machine Learning Context: Challenges and Incentivesmentioning

confidence: 99%

Section: Dispute Resolutionmentioning

confidence: 99%

See 2 more Smart Citations

Data Governance in the Age of Large-Scale Data-Driven Language Technology

Jernite,

Nguyen,

Biderman

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Common Crawl, meanwhile, has been shown to contain hate speech and over-represent sexually explicit content (Luccioni and Viviano, 2021), and typical web-crawling collection practices have no structures for supporting informed consent beyond websites' own terms and conditions policies that users rarely read (Cakebread, 2017;Obar and Oeldorf-Hirsch, 2020). Several documentation schemas for natural language processing (NLP) datasets (Bender and Friedman, 2018;Gebru et al, 2018;Gebru et al, 2021;Holland et al, 2018;Pushkarna et al, 2021) have been recently produced to aid NLP researchers in documenting their own datasets (Gao et al, 2020;Biderman et al, 1 http://commoncrawl.org/ 2022; Gehrmann et al, 2021;Wang et al, 2021) and even to retrospectively document and analyze datasets that were developed and released by others without thorough documentation (Bandy and Vincent, 2021;Kreutzer et al, 2021;Birhane et al, 2021;Dodge et al, 2021). Data documentation to support transparency has gained traction following calls for a reevaluation of the treatment of data in machine learning (ML) at large (Prabhu and Birhane, 2020;Jo and Gebru, 2020;Paullada et al, 2021;Gebru et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

McMillan-Major¹,

Alyafeai²,

Biderman³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

show abstract

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Talat¹,

Névéol²,

Biderman³

et al. 2022

Proceedings of BigScience Episode #5 -- Workshop on Challenges &Amp; Perspectives in Creating Large Language Models

View full text Add to dashboard Cite

Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work. We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages. We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.

show abstract

Datasheet for the Pile

Cited by 5 publications

References 28 publications

Data Governance in the Age of Large-Scale Data-Driven Language Technology

Data Governance in the Age of Large-Scale Data-Driven Language Technology

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Contact Info

Product

Resources

About