Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Camps, Jean-Baptiste; Gabay, Simon; Fièvre, Paul; Clérice, Thibault; Cafiero, Florian

doi:10.46298/jdmdh.6485

Cited by 1 publication

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The POS-annotated data, is a mixture of two different sources. On the one hand, there is the CornMol corpus (Camps et al, 2020), made up of normalised 17 th c. French comedies. On the other hand, there is a gold subset of the Presto corpus (Blumenthal et al, 2017), made up of texts of different genres written during the 16 th , 17 th and 18 th c., which have previously used to train annotation tools (Diwersy et al, 2017), and was heavily corrected by us to match our annotation principles (Gabay et al, 2020).…”

Section: Freem Lpmmentioning

confidence: 99%

“…Libraries, archives and museums, among others, are digitising large numbers of historical sources, from which high quality data must be extracted for further study by specialists of human sciences following new approaches such as "distant reading" (Moretti, 2013). Many (sub)tasks such as automatic OCR postcorrection (Rijhwani et al, 2021) and linguistic annotation (Camps et al, 2020) benefit from pretrained language models to improve their accuracy, and this is what motivated us to develop a BERT-like (Devlin et al, 2019) contextualised language model for Early Modern French. Languages evolve over time on many different levels: from one century to another, we see variations in spelling, syntax, the lexicon etc.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Gabay¹,

Suarez²,

Bartz³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16 th to the 18 th centuries). We present the FREEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FREEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FREEMmax corpus.

show abstract

Section: Freem Lpmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%