Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Henderson, Peter B.; Krass, Mark; Zheng, Lucia; Guha, Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.

doi:10.48550/arxiv.2207.00220

Cited by 4 publications

(3 citation statements)

References 59 publications

(100 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the legal text being very unique and requiring handling of issues as described in Section 1, there is significant scope of improvement over these models, pre-trained on the general domain. There have been efforts to pre-train transformers on the legal domain: (i) Chalkidis et al (2020) pre-trained BERT-base on EU and UK legislation and court documents from the US, European Court of Justice (ECJ) and European Court of Human Rights (ECtHR), and released the LegalBERT model; (ii) Zheng et al (2021) proposed CaseLaw-BERT, pre-trained on a corpus of US case law documents and contracts; (iii) Henderson et al (2022) prepared a huge corpus of US, Canada and EU documents (not just case law), called Pile of Law, and trained BERT-large on the same to yield the PoLBERT model; (iv) Xiao et al (2021) released Lawformer, a Longformer (Beltagy et al, 2020) based model pre-trained on Chinese legal text. The details of the pre-training datasets are available in Table 1 3 Dataset for pre-training LMs on…”

Section: Related Workmentioning

confidence: 99%

“…The latter two models are based on the same architecture as BERT-base. For the sake of fair comparison, we did not choose PoLBERT (Henderson et al, 2022) as a baseline since it is based on BERT-large, which is inherently more powerful.…”

Section: Application On End-tasksmentioning

confidence: 99%

“…Again, Zheng et al (2021) pre-trained BERT-base on only US case law documents, which was shown to improve performance over US legal datasets such as CaseHOLD. More recently, Henderson et al (2022) pre-trained BERT-large on 'Pile of Law' (PoL), a huge legal corpus of US and EU documents. 1 Finally, the Lawformer model (Xiao et al, 2021), pre-trained on Chinese legal text, has been designed keeping in mind that legal documents are usually much longer than documents in the general domain.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pre-training Transformers on Indian Legal Text

Paul¹,

Mandal²,

Goyal³

et al. 2022

Preprint

View full text Add to dashboard Cite

Natural Language Processing in the legal domain been benefited hugely by the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. There exist PLMs trained over European and US legal text, most notably Legal-BERT. However, with the rapidly increasing volume of NLP applications on Indian legal documents, and the distinguishing characteristics of Indian legal text, it has become necessary to pre-train LMs over Indian legal text as well. In this work, we introduce transformerbased PLMs pre-trained over a large corpus of Indian legal documents. We also apply these PLMs over several benchmark legal NLP tasks over both Indian legal text, as well as over legal text belonging to other domains (countries). The NLP tasks with which we experiment include Legal Statute Identification from facts, Semantic segmentation of court judgements, and Court Judgement Prediction. Our experiments demonstrate the utility of the Indiaspecific PLMs developed in this work.1 Although all these three models were named as Legal-BERT in the original research papers, we shall address them as LegalBERT (Chalkidis et al., 2020), CaseLawBERT (Zheng et al., 2021) and PoLBERT (Henderson et al., 2022) respectively, for sake of comprehension.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Application On End-tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Pre-training Transformers on Indian Legal Text

Paul¹,

Mandal²,

Goyal³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

The future landscape of large language models in medicine

Clusmann,

Kolbinger,

Muti

et al. 2023

Commun Med

200

View full text Add to dashboard Cite

Large language models (LLMs) are artificial intelligence (AI) tools specifically trained to process and generate text. LLMs attracted substantial public attention after OpenAI’s ChatGPT was made publicly available in November 2022. LLMs can often answer questions, summarize, paraphrase and translate text on a level that is nearly indistinguishable from human capabilities. The possibility to actively interact with models like ChatGPT makes LLMs attractive tools in various fields, including medicine. While these models have the potential to democratize medical knowledge and facilitate access to healthcare, they could equally distribute misinformation and exacerbate scientific misconduct due to a lack of accountability and transparency. In this article, we provide a systematic and comprehensive overview of the potentials and limitations of LLMs in clinical practice, medical research and medical education.

show abstract