Learning Deep Representations for Word Spotting under Weak Supervision

Gurjar, Neha; Sudholt, Sebastian; Fink, Gernot A.

doi:10.1109/das.2018.35

Cited by 37 publications

(28 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One possible augmentation strategy for word spotting is to apply different image transformations, such as shear, rotation and translation to the image, as has been proposed by Sudholt and Fink [25]. Gurjar et al [11] have shown that pre-training a CNN based word spotting approach with the synthetic dataset by Krishnan and Jawahar [15] can achieve a reasonable word spotting performance, even with only few training samples. Since the achieved improvements for both of these methods are independent from the particular training set, it is likely that improvements achieved through the application of augmentation, pre-training and sample selection will add up.…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we use a PHOCNet with temporal pyramid pooling (TPP) layer, as described by Sudholt and Fink [25]. As described by Gurjar et al [11], we train the PHOCNet using stochastic gradient descend and use binary cross entropy as loss function when training to predict PHOCs.…”

Section: Training Setupmentioning

confidence: 99%

“…In order to increase the generalization performance of the trained PHOCNet, we use data augmentation. While we augment the images of the HW-SYNTH dataset by varying their size, as described by Gurjar et al [11], we use the augmentation strategy proposed by Sudholt and Fink [25] for the evaluated datasets. Depending on the number of weight update steps, we generate 400 000 or 500 000 augmented samples for each training set.…”

Section: Training Setupmentioning

confidence: 99%

See 2 more Smart Citations

Representative Image Selection for Data Efficient Word Spotting

Westphal

Grahn

Lavesson

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper compares three different word image representations as base for label free sample selection for word spotting in historical handwritten documents. These representations are a temporal pyramid representation based on pixel counts, a graph based representation, and a pyramidal histogram of characters (PHOC) representation predicted by a PHOCNet trained on synthetic data. We show that the PHOC representation can help to reduce the amount of required training samples by up to 69% depending on the dataset, if it is learned iteratively in an active learning like fashion. While this works for larger datasets containing about 1 700 images, for smaller datasets with 100 images, we find that the temporal pyramid and the graph representation perform better.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Training Setupmentioning

confidence: 99%

Section: Training Setupmentioning

confidence: 99%

See 1 more Smart Citation

Representative Image Selection for Data Efficient Word Spotting

Westphal

Grahn

Lavesson

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Even though the attribute CNN approach has shown excellent performance on numerous commonly used academic benchmarks, this comes at the cost of requiring training material. Works such as [5] and [6] try to alleviate the data problem by transfer learning and incorporating synthetic data, but still the necessity of representative training data is inherent to any machine learning based approach.…”

Section: A Word Spottingmentioning

confidence: 99%

Exploring Confidence Measures for Word Spotting in Heterogeneous Datasets

Wolf

Oberdiek

Fink

2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

In recent years, convolutional neural networks (CNNs) took over the field of document analysis and they became the predominant model for word spotting. Especially attribute CNNs, which learn the mapping between a word image and an attribute representation, showed exceptional performances. The drawback of this approach is the overconfidence of neural networks when used out of their training distribution. In this paper, we explore different metrics for quantifying the confidence of a CNN in its predictions, specifically on the retrieval problem of word spotting. With these confidence measures, we limit the inability of a retrieval list to reject certain candidates. We investigate four different approaches that are either based on the network's attribute estimations or make use of a surrogate model. Our approach also aims at answering the question for which part of a dataset the retrieval system gives reliable results. We further show that there exists a direct relation between the proposed confidence measures and the quality of an estimated attribute representation.

show abstract

“…The use of convolutional neural networks [ 23 , 24 ] increased the performance of word spotting systems but these networks need a training set with a large amount of annotated data for being trained. Many solutions have been proposed for improving the word spotting performance without increasing the size of the training set: sample selection [ 25 ], data augmentation [ 23 ], transfer learning [ 26 , 27 ], training on synthetic data [ 22 , 28 ] and relaxed feature matching [ 29 ].…”

Section: Introductionmentioning

confidence: 99%

One Step Is Not Enough: A Multi-Step Procedure for Building the Training Set of a Query by String Keyword Spotting System to Assist the Transcription of Historical Document

2020

View full text Add to dashboard Cite

Digital libraries offer access to a large number of handwritten historical documents. These documents are available as raw images and therefore their content is not searchable. A fully manual transcription is time-consuming and expensive while a fully automatic transcription is cheaper but not comparable in terms of accuracy. The performance of automatic transcription systems is strictly related to the composition of the training set. We propose a multi-step procedure that exploits a Keyword Spotting system and human validation for building up a training set in a time shorter than the one required by a fully manual procedure. The multi-step procedure was tested on a data set made up of 50 pages extracted from the Bentham collection. The palaeographer that transcribed the data set with the multi-step procedure instead of the fully manual procedure had a time gain of 52.54%. Moreover, a small size training set that allowed the keyword spotting system to show a precision value greater than the recall value was built with the multi-step procedure in a time equal to 35.25% of the time required for annotating the whole data set.

show abstract

Learning Deep Representations for Word Spotting under Weak Supervision

Cited by 37 publications

References 24 publications

Representative Image Selection for Data Efficient Word Spotting

Representative Image Selection for Data Efficient Word Spotting

Exploring Confidence Measures for Word Spotting in Heterogeneous Datasets

One Step Is Not Enough: A Multi-Step Procedure for Building the Training Set of a Query by String Keyword Spotting System to Assist the Transcription of Historical Document

Contact Info

Product

Resources

About