JuriBERT: A Masked-Language Model Adaptation for French Legal Text

Douka, Stella; Abdine, Hadi; Vazirgiannis, Michalis; Hamdani, Rajaa El; Amariles, David Restrepo

doi:10.18653/v1/2021.nllp-1.9

Cited by 15 publications

(10 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our dataset is on another type of litigation (habitual residency of children) and we focus on the manual construction of our dataset, instead of automatically constructing it. Finally, Douka et al (2021) introduced JuriBERT, trained on Légifrance, an official web-site publishing all French law and evaluated it on topic classification tasks for documents from the Cour de Cassation (highest court in France).…”

Section: Related Workmentioning

confidence: 99%

“…Language Models As regards pretrained language models, we used FlauBERT and CamemBERT (Martin et al, 2020), two generic-purpose pretrained models for French, as well as JuriBERT (Douka et al, 2021), a language model trained only on data from the legal domain.…”

Section: Modelsmentioning

confidence: 99%

“…Thanks to this annotation procedure, lead by law experts, we compare the usefulness of various types of inputs for several types of classifiers, including bag-of-ngrams classifiers and BERT-based classifiers. In particular, we used 3 French pretrained language models: FlauBERT and Camem-BERT (Martin et al, 2020) that were trained on generic-domain data, and JuriBERT (Douka et al, 2021), that has been specifically trained on French legal texts.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pretrained Language Models v. Court Ruling Predictions: A Case Study on a Small Dataset of French Court of Appeal Rulings

Vaudaux,

Bazzoli,

Coavoux

et al. 2023

Proceedings of the Natural Legal Language Processing Workshop 2023

View full text Add to dashboard Cite

NLP systems are increasingly used in the law domain, either by legal institutions or by the industry. As a result there is a pressing need to characterize their strengths and weaknesses and understand their inner workings. This article presents a case study on the task of judicial decision prediction, on a small dataset from French Courts of Appeal. Specifically, our dataset of around 1000 decisions is about the habitual place of residency of children from divorced parents. The task consists in predicting, from the facts and reasons of the documents, whether the court rules that children should live with their mother or their father. Instead of feeding the whole document to a classifier, we carefully construct the dataset to make sure that the input to the classifier does not contain any 'spoilers' (it is often the case in court rulings that information all along the document mentions the final decision). Our results are mostly negative: even classifiers based on French pretrained language models (Flaubert, JuriBERT) do not classify the decisions with a reasonable accuracy. However, they can extract the decision when it is part of the input. With regards to these results, we argue that there is a strong caveat when constructing legal NLP datasets automatically.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Pretrained Language Models v. Court Ruling Predictions: A Case Study on a Small Dataset of French Court of Appeal Rulings

Vaudaux,

Bazzoli,

Coavoux

et al. 2023

Proceedings of the Natural Legal Language Processing Workshop 2023

View full text Add to dashboard Cite

show abstract

“…In the legal domain, text classification has an established tradition, both in the monolingual (Šarić et al, 2014;Papaloukas et al, 2021) and in the multi-lingual setting (Steinberger et al, 2006(Steinberger et al, , 2012Chalkidis et al, 2019;Avram et al, 2021;Chalkidis et al, 2021). Moreover, the large availability of legal data, produced by national and supranational public institutions, set the stage for the development of domain-adapted models (Chalkidis et al, 2020;Douka et al, 2021;Masala et al, 2021;Licari and Comandè, 2022). As for Italian, a multi-label classification system for bills has been proposed by De Angelis et al (2022), based on Bi-GRU architecture using static word embeddings and employing a dataset of 28k legal document tagged with the TESEO thesaurus.…”

Section: Related Workmentioning

confidence: 99%

Italian Legislative Text Classification for Gazzetta Ufficiale

Rovera,

Palmero Aprosio,

Greco

et al. 2023

Proceedings of the Natural Legal Language Processing Workshop 2023

View full text Add to dashboard Cite

This work introduces a novel, extensive annotated corpus for multi-label legislative text classification in Italian, based on legal acts from the Gazzetta Ufficiale, the official source of legislative information of the Italian state. The annotated dataset, which we released to the community, comprises over 363,000 titles of legislative acts, spanning over 30 years from 1988 until 2022. Moreover, we evaluate four models for text classification on the dataset, demonstrating how using only the acts' titles can achieve top-level classification performance, with a micro F1-score of 0.87. Also, our analysis shows how Italian domain-adapted legal models do not outperform general-purpose models on the task. Models' performance can be checked by users via a demonstrator system provided in support of this work.

show abstract

“…Pretrained language models (PLMs; Devlin et al 2019;Liu et al 2019;Raffel et al 2020) have seen broad adaptation across various domains such as biology , healthcare (Alsentzer et al, 2019), law (Chalkidis et al, 2020;Douka et al, 2021), software engineering (Tabassum et al, 2020), and social media (Röttger and Pierrehumbert, 2021;Guo et al, 2021). These models benefit from in-domain corpora (e.g., PubMed for the biomedical domain) to learn domain-specific terms and concepts.…”

Section: Introductionmentioning

confidence: 99%

GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding

Li,

Zhou,

Chiang

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Humans subconsciously engage in geospatial reasoning when reading articles. We recognize place names and their spatial relations in text and mentally associate them with their physical locations on Earth. Although pretrained language models can mimic this cognitive process using linguistic context, they do not utilize valuable geospatial information in large, widely available geographical databases, e.g., OpenStreetMap. This paper introduces GEOLM ( ), a geospatially grounded language model that enhances the understanding of geo-entities in natural language. GEOLM leverages geo-entity mentions as anchors to connect linguistic information in text corpora with geospatial information extracted from geographical databases. GEOLM connects the two types of context through contrastive learning and masked language modeling. It also incorporates a spatial coordinate embedding mechanism to encode distance and direction relations to capture geospatial context. In the experiment, we demonstrate that GEOLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing, which bridge the gap between natural language processing and geospatial sciences. The code is publicly available at https://github.com/ knowledge-computing/geolm.

show abstract

JuriBERT: A Masked-Language Model Adaptation for French Legal Text

Cited by 15 publications

References 3 publications

Pretrained Language Models v. Court Ruling Predictions: A Case Study on a Small Dataset of French Court of Appeal Rulings

Pretrained Language Models v. Court Ruling Predictions: A Case Study on a Small Dataset of French Court of Appeal Rulings

Italian Legislative Text Classification for Gazzetta Ufficiale

GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding

Contact Info

Product

Resources

About