Transkribus. A Platform for Automated Text Recognition and Searching of Historical Documents

Colutto, Sebastian; Kahle, Philip; Guenter, Hackl; Muehlberger, Guenter

doi:10.1109/escience.2019.00060

Cited by 18 publications

(8 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…60,61 In this way, OCR software can help to improve quality regarding the use of machine learningbased neural networks, as well as the adoption of postcorrection tools. 62,63 In some cases, there is no option to retrieve the datasets by means of an API, hindering the reuse of the digital collections locked inside siloed repositories. In addition, institutions publish the information as PDF files instead of plain text files amenable to computational use.…”

Section: Discussionmentioning

confidence: 99%

A benchmark of Spanish language datasets for computationally driven research

Candela

Saez

2021

Journal of Information Science

View full text Add to dashboard Cite

In the domain of Galleries, Libraries, Archives and Museums (GLAM) institutions, creative and innovative tools and methodologies for content delivery and user engagement have recently gained international attention. New methods have been proposed to publish digital collections as datasets amenable to computational use. Standardised benchmarks can be useful to broaden the scope of machine-actionable collections and to promote cultural and linguistic diversity. In this article, we propose a methodology to select datasets for computationally driven research applied to Spanish text corpora. This work seeks to encourage Spanish and Latin American institutions to publish machine-actionable collections based on best practices and avoiding common mistakes.

show abstract

Section: Discussionmentioning

confidence: 99%

A benchmark of Spanish language datasets for computationally driven research

Candela

Saez

2021

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…Different tools are available for carrying out manuscript transcription, as for example Aletheia [ 34 ], a ground truthing tool, and Transkribus [ 35 ], a platform for the digitization, transcription, recognition and searching of historical documents. Usually, most of the tools adopt an architecture as the one shown in Figure 1 : a collection of documents, the data set DS , is manually transcribed and the annotated word images are included in the training set.…”

Section: Methodsmentioning

confidence: 99%

One Step Is Not Enough: A Multi-Step Procedure for Building the Training Set of a Query by String Keyword Spotting System to Assist the Transcription of Historical Document

2020

View full text Add to dashboard Cite

Digital libraries offer access to a large number of handwritten historical documents. These documents are available as raw images and therefore their content is not searchable. A fully manual transcription is time-consuming and expensive while a fully automatic transcription is cheaper but not comparable in terms of accuracy. The performance of automatic transcription systems is strictly related to the composition of the training set. We propose a multi-step procedure that exploits a Keyword Spotting system and human validation for building up a training set in a time shorter than the one required by a fully manual procedure. The multi-step procedure was tested on a data set made up of 50 pages extracted from the Bentham collection. The palaeographer that transcribed the data set with the multi-step procedure instead of the fully manual procedure had a time gain of 52.54%. Moreover, a small size training set that allowed the keyword spotting system to show a precision value greater than the recall value was built with the multi-step procedure in a time equal to 35.25% of the time required for annotating the whole data set.

show abstract

“…There are a few existing commercial products with functions similar to that of the proposed system. Some of them adapt a pre-trained system for Ottoman documents ( [23]) while others does not provide a transcription but only Optical Character Recognition (OCR) service ([1, 2] ). Furthermore it is impractical to evaluate their performance because of the usage restrictions applied in the free versions.…”

Section: Introductionmentioning

confidence: 99%

Transcription of Ottoman Machine-Print Documents

Tasdemir

Kizilirmak

Akcan

et al. 2022

Preprint

View full text Add to dashboard Cite

With the ever increasing speed of the digitization process, a large collection of Ottoman documents is accessible to researchers and the general public. But, the majority of the users interested in these documents can not read these documents unless they are transcripted to the modern Turkish script which use an extended version of the Latin alphabet. Manual transcription of such a massive amount of documents is beyond the capacity of human experts. As a solution, we propose an automatic recognition system for printed Ottoman documents which transcribes Ottoman texts directly to the modern Turkish script. We evaluated three decoding strategies including the Word Beam Search decoder that allows to use a recognition lexicon and n-gram statistics during the decoding phase. The system achieves 2.25% character error rate and 6.42% word error rate on a test set of 1.4K samples, using the test set transcriptions as the recognition lexicon. Using a general purpose, large lexicon of the Ottoman era (260K words and 77% test coverage), the performance is measured as 3.68% character error rate and 16.61% word error rate.

show abstract

Transkribus. A Platform for Automated Text Recognition and Searching of Historical Documents

Cited by 18 publications

References 2 publications

A benchmark of Spanish language datasets for computationally driven research

A benchmark of Spanish language datasets for computationally driven research

One Step Is Not Enough: A Multi-Step Procedure for Building the Training Set of a Query by String Keyword Spotting System to Assist the Transcription of Historical Document

Transcription of Ottoman Machine-Print Documents

Contact Info

Product

Resources

About