2020
DOI: 10.46298/jdmdh.5581
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Abstract: International audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 6 publications
(6 reference statements)
0
4
0
Order By: Relevance
“…It must be stressed that spaces are the most important source of error in medieval HTR models 21 : for the model Bicerin (Pinche [2021a]), spaces represent 33.9% of errors 22 ). In the current state of the art of HTR, some workflows (Camps et al [2021(Camps et al [ , 2020) chose to solve this problem with a secondary tool such as Boudams (Clérice [2019]), a deep learning tool built for word segmentation in Latin or Medieval French. Of these, the microfilmed manuscripts (see Table 2) all dating from the end of the 13 th century or the 14 th century and written in Old French are kept for evaluating performances as our test dataset 23 .…”
Section: Datasetmentioning
confidence: 99%
“…It must be stressed that spaces are the most important source of error in medieval HTR models 21 : for the model Bicerin (Pinche [2021a]), spaces represent 33.9% of errors 22 ). In the current state of the art of HTR, some workflows (Camps et al [2021(Camps et al [ , 2020) chose to solve this problem with a secondary tool such as Boudams (Clérice [2019]), a deep learning tool built for word segmentation in Latin or Medieval French. Of these, the microfilmed manuscripts (see Table 2) all dating from the end of the 13 th century or the 14 th century and written in Old French are kept for evaluating performances as our test dataset 23 .…”
Section: Datasetmentioning
confidence: 99%
“…For the Latin inscriptions, transcribed by Astori in upper-case letters, we have used the Modèle imprimé 16-18e Fra+Lat. The model comes from the CREMMA project and combines French and Latin training data (such as that in this repository: [Clérice, 2021]).…”
Section: Datasets and Modelsmentioning
confidence: 99%
“…Following the competitions organized in recent years, at ICFHR and ICDAR notably, several robust architectures for layout analysis of historical documents have been developed [8], whose application to non-Latin script documents provide equivalent results [10,14]. The HTR architectures specialized on a type of document or on a hand also achieve a very high recognition score, even though the literature is mostly Latin script based, as well as the proven pipelines composed of character-level HTR and post-processing [7]. The non-Latin, cursive and RTL writings, like the Arabic scripts, remain an open problem in digital humanities with a wide variety of approaches [11].…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, the Maghrebi scripts constitute a family of rounded scripts that share a number of characteristics, first of all very rounded loops, that can be seen in the manuscripts in the present dataset (see infra 2.3). The main characteristics 7 of the scripts are displayed in table 1.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation