Focusing Attention: Towards Accurate Text Recognition in Natural Images

Cheng, Zhanzhan; Bai, Fan; Xu, Yunlu; Zheng, Gang; Pu, Shiliang; Zhou, Shuigeng

doi:10.1109/iccv.2017.543

Cited by 461 publications

(440 citation statements)

References 25 publications

Supporting

Mentioning

424

Contrasting

Order By: Relevance

“…This inevitably makes the models more complicated and difficult to train by requiring a significantly longer training time with a large amount of training samples. Recent work, such as [33,4,12], had shown that the performance of RNN-based methods can be improved considerably by introducing char-level attention mechanism which is able to encode strong character information implicitly or explicitly. This enables the models to have the ability to identify characters more accurately, and essentially adds additional constraints to the models which in turn reduce the search space, leading to performance boost.…”

Section: Character Branchmentioning

confidence: 99%

Convolutional Character Networks

Xing

Tian

Huang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

167

View full text Add to dashboard Cite

Recent progress has been made on developing a unified framework for joint text detection and recognition in natural images, but existing joint models were mostly built on two-stage framework by involving ROI pooling, which can degrade the performance on recognition task. In this work, we propose convolutional character networks, referred as CharNet, which is an one-stage model that can process two tasks simultaneously in one pass. CharNet directly outputs bounding boxes of words and characters, with corresponding character labels. We utilize character as basic element, allowing us to overcome the main difficulty of existing approaches that attempted to optimize text detection jointly with a RNN-based recognition branch. In addition, we develop an iterative character detection approach able to transform the ability of character detection learned from synthetic data to real-world images. These technical improvements result in a simple, compact, yet powerful onestage model that works reliably on multi-orientation and curved text. We evaluate CharNet on three standard benchmarks, where it consistently outperforms the state-of-theart approaches [25,24] by a large margin, e.g., with improvements of 65.33%→71.08% (with generic lexicon) on ICDAR 2015, and 54.0%→69.23% on Total-Text, on endto-end text recognition. Code is available at: https:// github.com/MalongTech/research-charnet.

show abstract

Section: Character Branchmentioning

confidence: 99%

Convolutional Character Networks

Xing

Tian

Huang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

167

View full text Add to dashboard Cite

show abstract

“…Among them, the first group contains several well-known recognition networks, including CRNN [1] and GRCNN [5]. We then compare ours with previous attention aware approaches such as FAN [8], FCN [14], and Baek et al [16].…”

Section: Configurationmentioning

confidence: 99%

“…The attention mechanism is incorporated into the decoder. For example, Lee et al [7] proposed to use an attention-based decoder for text-output prediction, while Cheng et al [8] presented the Focusing Attention Network (FAN) to tackle attention drift problem in order to improve the performance of regular text recognition. Besides, some previous work also exploited to handle the irregular scene text images at the beginning of the encoder.…”

Section: Introductionmentioning

confidence: 99%

Scene Text Recognition with Temporal Convolutional Encoder

Zheng³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Texts from scene images typically consist of several characters and exhibit a characteristic sequence structure. Existing methods capture the structure with the sequence-tosequence models by an encoder to have the visual representations and then a decoder to translate the features into the label sequence. In this paper, we study text recognition framework by considering the long-term temporal dependencies in the encoder stage. We demonstrate that the proposed Temporal Convolutional Encoder with increased sequential extents improves the accuracy of text recognition. We also study the impact of different attention modules in convolutional blocks for learning accurate text representations. We conduct comparisons on seven datasets and the experiments demonstrate the effectiveness of our proposed approach.

show abstract

“…Generally, in the encoding stage, the convolutional neural networks (CNN) are used to extract features from the input image, whereas in the decoding stage, the encoded feature vectors are transcribed into target strings by exploiting the recurrent neural network (RNN) [12], [13], (c) The recurrent decoding process of attention-based decoder with AEG i a n a connectionist temporal classification (CTC) [14] or attention mechanism [15]. In particular, the attention-based approaches [4], [11], [16], [17], [18] often achieve better performance owing to the focus on informative areas.…”

Section: Introductionmentioning

confidence: 99%

Adaptive embedding gate for attention-based scene text recognition

et al. 2020

View full text Add to dashboard Cite

Scene text recognition has attracted particular research interest because it is a very challenging problem and has various applications. The most cutting-edge methods are attentional encoder-decoder frameworks that learn the alignment between the input image and output sequences. In particular, the decoder recurrently outputs predictions, using the prediction of the previous step as a guidance for every time step. In this study, we point out that the inappropriate use of previous predictions in existing attention mechanisms restricts the recognition performance and brings instability. To handle this problem, we propose a novel module, namely adaptive embedding gate (AEG). The proposed AEG focuses on introducing high-order character language models to attention mechanism by controlling the information transmission between adjacent characters. AEG is a flexible module and can be easily integrated into the state-ofthe-art attentional methods. We evaluate its effectiveness as well as robustness on a number of standard benchmarks, including the IIIT5K, SVT, SVT-P, CUTE80, and ICDAR datasets. Experimental results demonstrate that AEG can significantly boost recognition performance and bring better robustness. h t Recurrent Attention Decoder

show abstract

Focusing Attention: Towards Accurate Text Recognition in Natural Images

Cited by 461 publications

References 25 publications

Convolutional Character Networks

Convolutional Character Networks

Scene Text Recognition with Temporal Convolutional Encoder

Adaptive embedding gate for attention-based scene text recognition

Contact Info

Product

Resources

About