2021
DOI: 10.46298/jdmdh.6485
|View full text |Cite
|
Sign up to set email alerts
|

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Abstract: This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 14 publications
0
2
0
Order By: Relevance
“…The POS-annotated data, is a mixture of two different sources. On the one hand, there is the CornMol corpus (Camps et al, 2020), made up of normalised 17 th c. French comedies. On the other hand, there is a gold subset of the Presto corpus (Blumenthal et al, 2017), made up of texts of different genres written during the 16 th , 17 th and 18 th c., which have previously used to train annotation tools (Diwersy et al, 2017), and was heavily corrected by us to match our annotation principles (Gabay et al, 2020).…”
Section: Freem Lpmmentioning
confidence: 99%
See 1 more Smart Citation
“…The POS-annotated data, is a mixture of two different sources. On the one hand, there is the CornMol corpus (Camps et al, 2020), made up of normalised 17 th c. French comedies. On the other hand, there is a gold subset of the Presto corpus (Blumenthal et al, 2017), made up of texts of different genres written during the 16 th , 17 th and 18 th c., which have previously used to train annotation tools (Diwersy et al, 2017), and was heavily corrected by us to match our annotation principles (Gabay et al, 2020).…”
Section: Freem Lpmmentioning
confidence: 99%
“…Libraries, archives and museums, among others, are digitising large numbers of historical sources, from which high quality data must be extracted for further study by specialists of human sciences following new approaches such as "distant reading" (Moretti, 2013). Many (sub)tasks such as automatic OCR postcorrection (Rijhwani et al, 2021) and linguistic annotation (Camps et al, 2020) benefit from pretrained language models to improve their accuracy, and this is what motivated us to develop a BERT-like (Devlin et al, 2019) contextualised language model for Early Modern French. Languages evolve over time on many different levels: from one century to another, we see variations in spelling, syntax, the lexicon etc.…”
Section: Introductionmentioning
confidence: 99%