Attribute CNNs for word spotting in handwritten documents

Sudholt, Sebastian; Fink, Gernot A.

doi:10.1007/s10032-018-0295-0

Cited by 55 publications

(45 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Taking inspirations for word attributes, for handwritten images, Poznanski et al [52] adapted VGGNet [72] for recognizing phoc attributes by having multiple parallel fully connected layers, each one predicting phoc attributes at a particular level. In similar spirits, different architectures [35,74,76,83] were proposed using cnn networks which embed features into different textual embedding spaces defined by phoc. In [74], Sudholt et al proposes an architecture to directly embed image features to phoc attributes by having sigmoid activation in the final layer and thereby avoiding multiple fully connected layers as presented in [52].…”

Section: Deep Learningmentioning

confidence: 99%

“…The next set of methods in this space use the principle of attribute embedding framework using deep cnn networks. Here, PHOCNet [74] and TPP-PHOCNet [75,76] uses the output space of cnn as phoc embedding while Triplet-CNN [83] explores with different embeddings such as phoc, dctow and few semantic embeddings. In the table, we report the best performance of Triplet-CNN across different proposed embeddings.…”

Section: Architecture Evaluationmentioning

confidence: 99%

“…When the data is limited, fine-tuning a pre-trained network has also been demonstrated to be very effective. In the domain of document images these features have shown better performance for word spotting [37,74,76,83], recognition [52], document classification [26], layout analysis [10], etc. In this work, we propose a deep cnn architecture named as HWNet v2, for the task of learning an efficient word level representation for handwritten documents which can handle multiple writers and, is robust to common forms of degradation and noise.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

HWNet v2: an efficient word image representation for handwritten documents

Krishnan

Jawahar

2019

IJDAR

View full text Add to dashboard Cite

We present a framework for learning an efficient holistic representation for handwritten word images. The proposed method uses a deep convolutional neural network with traditional classification loss. The major strengths of our work lie in: (i) the efficient usage of synthetic data to pre-train a deep network, (ii) an adapted version of the ResNet-34 architecture with the region of interest pooling (referred to as HWNet v2) which learns discriminative features for variable sized word images, and (iii) a realistic augmentation of training data with multiple scales and distortions which mimics the natural process of handwriting. We further investigate the process of transfer learning to reduce the domain gap between synthetic and real domain, and also analyze the invariances learned at different layers of the network using visualization techniques proposed in the literature.Our representation leads to a state-of-the-art word spotting performance on standard handwritten datasets and historical manuscripts in different languages with minimal representation size. On the challenging iam dataset, our method is first to report an mAP of around 0.90 for word spotting with a representation size of just 32 dimensions. Furthermore, we also present results on printed document datasets in English and Indic scripts which validates the generic nature of the proposed framework for learning word image representation.

show abstract

Section: Deep Learningmentioning

confidence: 99%

Section: Architecture Evaluationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HWNet v2: an efficient word image representation for handwritten documents

Krishnan

Jawahar

2019

IJDAR

View full text Add to dashboard Cite

show abstract

“…e.g. [2], [3]) require a previous segmentation of document pages into individual word images, which is in general not an easy to solve problem. The segmentation-free approach does not pose this requirement, but aims at solving the retrieval and segmentation problem jointly.…”

Section: A Word Spottingmentioning

confidence: 99%

“…Our baseline word spotting system is based on the design of [2]. The attribute embedding is a 4-level PHOC representation of partitions 1, 2, 4 and 8 based on the lower case Latin alphabet plus digits.…”

Section: A Word Spottingmentioning

confidence: 99%

Exploring Confidence Measures for Word Spotting in Heterogeneous Datasets

Wolf

Oberdiek

Fink

2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

In recent years, convolutional neural networks (CNNs) took over the field of document analysis and they became the predominant model for word spotting. Especially attribute CNNs, which learn the mapping between a word image and an attribute representation, showed exceptional performances. The drawback of this approach is the overconfidence of neural networks when used out of their training distribution. In this paper, we explore different metrics for quantifying the confidence of a CNN in its predictions, specifically on the retrieval problem of word spotting. With these confidence measures, we limit the inability of a retrieval list to reject certain candidates. We investigate four different approaches that are either based on the network's attribute estimations or make use of a surrogate model. Our approach also aims at answering the question for which part of a dataset the retrieval system gives reliable results. We further show that there exists a direct relation between the proposed confidence measures and the quality of an estimated attribute representation.

show abstract