2022
DOI: 10.48550/arxiv.2207.00220
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Abstract: One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take into account context. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a ∼25… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 59 publications
(100 reference statements)
0
3
0
Order By: Relevance
“…However, the legal text being very unique and requiring handling of issues as described in Section 1, there is significant scope of improvement over these models, pre-trained on the general domain. There have been efforts to pre-train transformers on the legal domain: (i) Chalkidis et al (2020) pre-trained BERT-base on EU and UK legislation and court documents from the US, European Court of Justice (ECJ) and European Court of Human Rights (ECtHR), and released the LegalBERT model; (ii) Zheng et al (2021) proposed CaseLaw-BERT, pre-trained on a corpus of US case law documents and contracts; (iii) Henderson et al (2022) prepared a huge corpus of US, Canada and EU documents (not just case law), called Pile of Law, and trained BERT-large on the same to yield the PoLBERT model; (iv) Xiao et al (2021) released Lawformer, a Longformer (Beltagy et al, 2020) based model pre-trained on Chinese legal text. The details of the pre-training datasets are available in Table 1 3 Dataset for pre-training LMs on…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…However, the legal text being very unique and requiring handling of issues as described in Section 1, there is significant scope of improvement over these models, pre-trained on the general domain. There have been efforts to pre-train transformers on the legal domain: (i) Chalkidis et al (2020) pre-trained BERT-base on EU and UK legislation and court documents from the US, European Court of Justice (ECJ) and European Court of Human Rights (ECtHR), and released the LegalBERT model; (ii) Zheng et al (2021) proposed CaseLaw-BERT, pre-trained on a corpus of US case law documents and contracts; (iii) Henderson et al (2022) prepared a huge corpus of US, Canada and EU documents (not just case law), called Pile of Law, and trained BERT-large on the same to yield the PoLBERT model; (iv) Xiao et al (2021) released Lawformer, a Longformer (Beltagy et al, 2020) based model pre-trained on Chinese legal text. The details of the pre-training datasets are available in Table 1 3 Dataset for pre-training LMs on…”
Section: Related Workmentioning
confidence: 99%
“…The latter two models are based on the same architecture as BERT-base. For the sake of fair comparison, we did not choose PoLBERT (Henderson et al, 2022) as a baseline since it is based on BERT-large, which is inherently more powerful.…”
Section: Application On End-tasksmentioning
confidence: 99%
See 1 more Smart Citation