Generation of learning samples for historical handwriting recognition using image degradation

Fischer, Andreas; Visani, Muriel; Kieu, van Cuong; Suen, Ching Y.

doi:10.1145/2501115.2501123

Cited by 12 publications

(10 citation statements)

References 23 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The 1.524 images from the second dataset have been created using the 127 original images and transformed using our 3D distortion model. The tests presented in [39,59] confirm the conclusion of [60] about the impact of the degradation level on re-training, either for a task of character recognition or layout extraction.…”

Section: Document Image Generation For Retraining Tasksupporting

confidence: 80%

“…The DocCreator ability to create synthetic documents that mimic real ones is effective for typewritten and handwritten characters (as long as the characters are apart from one another). Images created with DocCreator have already been used in many DIAR contexts: text/background/image pixel classification [36]; staff removal [13,37,38]; and handwritten character recognition [39]. In this article we present how DocCreator can be useful to enhance a binarization algorithm and for OCR performance prediction.…”

Section: Algorithms For Synthetic Data Augmentationmentioning

confidence: 99%

“…This database consists of three sets: the Saint Gall set containing 60 images (1.410 text lines) in Latin, the Parzival set containing 47 images (4.477 text lines) in Medieval German, and the Washington set containing 20 images in English. The authors of [39] used the character degradation model to create two extended databases of the IAM-HistDB. The first one is composed of 17.661 images degraded with the ink model.…”

Section: Document Image Generation For Retraining Taskmentioning

confidence: 99%

See 2 more Smart Citations

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

et al. 2017

Self Cite

View full text Add to dashboard Cite

Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.

show abstract

Section: Document Image Generation For Retraining Tasksupporting

confidence: 80%

Section: Algorithms For Synthetic Data Augmentationmentioning

confidence: 99%

Section: Document Image Generation For Retraining Taskmentioning

confidence: 99%

See 1 more Smart Citation

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Generating synthetic data for the training and evaluation of document image processing systems has been widely addressed in recent years [6,20,21,22,23,24]. In particular, image binarization evaluation is usually computed at pixel level, requiring an accurate groundtruth, with the inner complexity of data supervision at this detail level.…”

Section: Image Processing and Groundtruth Generationmentioning

confidence: 99%

The NoisyOffice Database: A Corpus To Train Supervised Machine Learning Filters For Image Processing

Castro-Bleda

España-Boquera

Pastor-Pellicer

et al. 2019

The Computer Journal

View full text Add to dashboard Cite

This paper presents the ‘NoisyOffice’ database. It consists of images of printed text documents with noise mainly caused by uncleanliness from a generic office, such as coffee stains and footprints on documents or folded and wrinkled sheets with degraded printed text. This corpus is intended to train and evaluate supervised learning methods for cleaning, binarization and enhancement of noisy images of grayscale text documents. As an example, several experiments of image enhancement and binarization are presented by using deep learning techniques. Also, double-resolution images are also provided for testing super-resolution methods. The corpus is freely available at UCI Machine Learning Repository. Finally, a challenge organized by Kaggle Inc. to denoise images, using the database, is described in order to show its suitability for benchmarking of image processing systems.

show abstract

“…Fischer et al [6] propose a method to generate training samples for historical handwriting recognition. Three degradation models are applied on binary images: Kanungo [3], character degradation from [5] and geometric distortion from the evaluation of [7].…”

Section: Related Workmentioning

confidence: 99%

Semi-Synthetic Data Augmentation of Scanned Historical Documents

Karpinski

Belaïd

2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

This paper proposes a fully automatic new method for generating semi-synthetic images of historical documents to increase the number of training samples in small datasets. This method extracts and mixes background only images (BOI) with text only images (TOI) issued from two different sources to create semi-synthetic images. The TOIs are extracted with the help of a binary mask obtained by binarizing the image. The BOIs are reconstructed from the original image by replacing TOI pixels using an inpainting method. Finally, a TOI can be efficiently integrated in a BOI using the gradient domain, thus creating a new semi-synthetic image. The idea behind this technique is to automatically obtain documents close to real ones with different backgrounds to highlight the content. Experiments are conducted on the public HisDB dataset which contains few labeled images. We show that the proposed method improves the performance results of a semantic segmentation and baseline extraction task.

show abstract

Generation of learning samples for historical handwriting recognition using image degradation

Cited by 12 publications

References 23 publications

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

The NoisyOffice Database: A Corpus To Train Supervised Machine Learning Filters For Image Processing

Semi-Synthetic Data Augmentation of Scanned Historical Documents

Contact Info

Product

Resources

About