Dense Chained Attention Network for Scene Text Recognition

Gao, Yunze; Chen, Yingying; Wang, Jinqiao; Tang, Ming; Lu, Hanqing

doi:10.1109/icip.2018.8451273

Cited by 10 publications

(15 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparison with Methods based on the Traditional FC-LSTM: As previously introduced, traditional FC-LSTM is widely used in existing recognizers. Among the methods listed in Table 1, RARE [8], AON [6] and FAN [5] combined FC-LSTM with the attention mechanism in the fully connected way when performing sequential transcription, while CRNN [7], R 2 AM [17], Gao's model [4] and SqueezedText [20] utilized FC-LSTM for frame-level prediction, sequential feature encoding or other purposes. As shown in Table 1, our proposed FACLSTM outperforms these FC-LSTM-based methods by large margins on both regular text dataset IIIT5K (90.5% vs 87.4%) and curved text dataset CUTE (83.33% and 76.8%) when no lexicon is used.…”

Section: Resultsmentioning

confidence: 99%

“…Methods based on LSTM: LSTM is widely used in the existing state-of-the-art recognizers for three purposes, i.e., producing frame-level predictions required by the subsequent sequential transcription module [4,7], encoding sequential features with considering historical information [8,16], and directly generating sequential predictions when cooperating with the attention mechanism [5,6,13,16,17]. For example, CRNN proposed by Shi et al [7] was composed of three parts, i.e., the convolution module used to extract features from input images, a bi-LSTM layer built to make predictions for individual frames, and a CTC-based sequential transcription component utilized to infer sequential outputs from frame-level predictions.…”

Section: Related Workmentioning

confidence: 99%

“…In other recognizers [4,10,12], the feature maps F and maps A generated for other purposes are always combined with the element-wise multiplication ⊗ in the way of F out = F ⊗ (1 + A). However, in our experiments we find that directly concatenating feature maps F and character center masks M achieves better performance, which means the subsequent attention-based module prefers to learn patterns from F and M directly, rather than from their fused results.…”

Section: Focused Attention Modulementioning

confidence: 99%

“…However, the LSTM used in these recognizers is the fully-connected-LSTM (FC-LSTM) that only takes stream signals like sentences or audio as inputs and connects them in a fully connected way, while scene text recognition generates sequential outputs from 2-D images. To adapt FC-LSTM to scene text recognition, the most straightforward way is pooling 2-D feature maps to a height of one or flattening them into 1-D sequential feature vectors [4,5,6,7,8], as shown in Fig 2(a). Unfortunately, such operations could severely disrupt the valuable spatial correlation relationships among pixels, which is essential to computer vision tasks, especially to scene text recognition, where the structures of strokes are the key factors to discriminate characters.…”

Section: Introductionmentioning

confidence: 99%

“…However, in existing models, both FC-LSTM and ConvLSTM are used only for frame-level prediction and are incapable of producing sequential outputs from one single input image unless the Connectionist Temporal Classification (CTC) [4,7,12] or attention mechanism [5,6,8,13] is incorporated. To perform sequential prediction and, meanwhile, provide the model spatial awareness, we further improve ConvL-STM by embedding the attention mechanism into the structure.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

FACLSTM: ConvLSTM with focused attention for scene text recognition

Wang¹,

Jia²,

He³

et al. 2020

Sci. China Inf. Sci.

View full text Add to dashboard Cite

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Due to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., Focused Attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Focused Attention Modulementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

FACLSTM: ConvLSTM with focused attention for scene text recognition

Wang¹,

Jia²,

He³

et al. 2020

Sci. China Inf. Sci.

View full text Add to dashboard Cite

show abstract

ReELFA: A Scene Text Recognizer with Encoded Location and Focused Attention

Wang

Jia

et al. 2019

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)

View full text Add to dashboard Cite

LSTM and attention mechanism have been widely used for scene text recognition. However, existing LSTM-based recognizers usually convert 2D feature maps into 1D space by flattening or pooling operations, resulting in the neglect of spatial information of text images. Additionally, the attention drift problem, where models fail to align targets at proper feature regions, has a serious impact on the recognition performance of existing models. To tackle the above problems, in this paper, we propose a scene text Recognizer with Encoded Location and Focused Attention, i.e., ReELFA. Our ReELFA utilizes one-hot encoded coordinates to indicate the spatial relationship of pixels and character center masks to help focus attention on the right feature areas. Experiments conducted on benchmark datasets IIIT5K, SVT, CUTE and IC15 demonstrate that the proposed method achieves comparable performance on the regular, lowresolution and noisy text images, and outperforms state-of-the-art approaches on the more challenging curved text images.

show abstract

Analysis Framework of Comparative Literature Wisdom Model based on Cyclic Text Recognition Algorithm

Wang

2022

2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA)

View full text Add to dashboard Cite

Dense Chained Attention Network for Scene Text Recognition

Cited by 10 publications

References 13 publications

FACLSTM: ConvLSTM with focused attention for scene text recognition

FACLSTM: ConvLSTM with focused attention for scene text recognition

ReELFA: A Scene Text Recognizer with Encoded Location and Focused Attention

Analysis Framework of Comparative Literature Wisdom Model based on Cyclic Text Recognition Algorithm

Contact Info

Product

Resources

About