Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Clérice, Thibault

doi:10.46298/jdmdh.5581

Cited by 5 publications

(5 citation statements)

References 6 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It must be stressed that spaces are the most important source of error in medieval HTR models 21 : for the model Bicerin (Pinche [2021a]), spaces represent 33.9% of errors 22 ). In the current state of the art of HTR, some workflows (Camps et al [2021(Camps et al [ , 2020) chose to solve this problem with a secondary tool such as Boudams (Clérice [2019]), a deep learning tool built for word segmentation in Latin or Medieval French. Of these, the microfilmed manuscripts (see Table 2) all dating from the end of the 13 th century or the 14 th century and written in Old French are kept for evaluating performances as our test dataset 23 .…”

Section: Datasetmentioning

confidence: 99%

Artificial colorization of digitized microfilms: a preliminary study

Clérice¹,

Pinche²

2023

Journal of Data Mining &Amp; Digital Humanities

View full text Add to dashboard Cite

A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.

show abstract

Section: Datasetmentioning

confidence: 99%

Artificial colorization of digitized microfilms: a preliminary study

Clérice¹,

Pinche²

2023

Journal of Data Mining &Amp; Digital Humanities

View full text Add to dashboard Cite

show abstract

“…For the Latin inscriptions, transcribed by Astori in upper-case letters, we have used the Modèle imprimé 16-18e Fra+Lat. The model comes from the CREMMA project and combines French and Latin training data (such as that in this repository: [Clérice, 2021]).…”

Section: Datasets and Modelsmentioning

confidence: 99%

EpiSearch. Identifying Ancient Inscriptions in Epigraphic Manuscripts

Calvelli,

Boschetti,

Tommasi

2023

Journal of Data Mining &Amp; Digital Humanities

View full text Add to dashboard Cite

Epigraphic documents are an essential source of evidence for our knowledge of the ancient world. Nonetheless, a significant number of inscriptions have not been preserved in their material form. In fact, their texts can only be recovered thanks to handwritten materials and, in particular, the so-called epigraphic manuscripts. EpiSearch is a pilot project that explores the application of digital technologies deployed to retrieve the epigraphic evidence found in these sources. The application of Handwritten Text Recognition (HTR) to epigraphic manuscripts is a challenging task, given the nature and graphic layout of these documents. Yet, our research shows that, even with some limits, HTR technologies can be used successfully.

show abstract

“…Following the competitions organized in recent years, at ICFHR and ICDAR notably, several robust architectures for layout analysis of historical documents have been developed [8], whose application to non-Latin script documents provide equivalent results [10,14]. The HTR architectures specialized on a type of document or on a hand also achieve a very high recognition score, even though the literature is mostly Latin script based, as well as the proven pipelines composed of character-level HTR and post-processing [7]. The non-Latin, cursive and RTL writings, like the Arabic scripts, remain an open problem in digital humanities with a wide variety of approaches [11].…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, the Maghrebi scripts constitute a family of rounded scripts that share a number of characteristics, first of all very rounded loops, that can be seen in the manuscripts in the present dataset (see infra 2.3). The main characteristics 7 of the scripts are displayed in table 1.…”

Section: Introductionmentioning

confidence: 99%

“…The most recent works, in particular those of U. Bongianino, have foregrounded the different itineraries (from books to qurans, from al-Andalus to the Maghreb) followed by these writings between the 10th and the 13th century [4]. 7 Characteristics are taken from U. Bongianino [4]; theoretical realizations are taken from the article of N. Van de Boogert upon which U. Bongianino draws [13].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi

Vidal-Gorène

Lucas

Salah

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The Arabic scripts raise numerous issues in text recognition and layout analysis. To overcome these, several datasets and methods have been proposed in recent years. Although the latter are focused on common scripts and layout, many Arabic writings and written traditions remain under-resourced. We therefore propose a new dataset comprising 300 images representative of the handwritten production of the Arabic Maghrebi scripts. This dataset is the achievement of a collaborative work undertaken in the first quarter of 2021, and it offers several levels of annotation and transcription. The article intends to shed light on the specificities of these writing and manuscripts, as well as highlight the challenges of the recognition. The collaborative tools used for the creation of the dataset are assessed and the dataset itself is evaluated with state of the art methods in layout analysis. The word-based text recognition method used and experimented on for these writings achieves CER of 4.8% on average. The pipeline described constitutes an experience feedback for the quick creation of data and the training of effective HTR systems for Arabic scripts and non-Latin scripts in general.

show abstract

Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Cited by 5 publications

References 6 publications

Artificial colorization of digitized microfilms: a preliminary study

Artificial colorization of digitized microfilms: a preliminary study

EpiSearch. Identifying Ancient Inscriptions in Epigraphic Manuscripts

RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi

Contact Info

Product

Resources

About