Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition

Li, Hui; Wang, Peng; Shen, Chunhua; Zhang, Guyu

doi:10.1609/aaai.v33i01.33018610

Cited by 338 publications

(316 citation statements)

References 20 publications

Supporting

Mentioning

295

Contrasting

Unclassified

Order By: Relevance

“…Performance on Curved Text: As for curved dataset, we outperforms previous state-of-the-art method using rectification [8] by an absolute improvement of 5% on CUTE. CAP-Net also achieves higher score than the 2D attention baseline [10] by 3.5% on CUTE, while surpassing by 7.2% on IC15 and 2.4% on SVT-P. The superior performance verifies the effectiveness of our method.…”

Section: Methodsmentioning

confidence: 68%

See 1 more Smart Citation

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Guan

Bian

Yao

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Irregular scene text recognition has attracted much attention from the research community, mainly due to the complexity of shapes of text in natural scene. However, recent methods either rely on shape-sensitive modules such as bounding box regression, or discard sequence learning. To tackle these issues, we propose a pair of coupling modules, termed as Character Anchoring Module (CAM) and Anchor Pooling Module (APM), to extract high-level semantics from twodimensional space to form feature sequences. The proposed CAM localizes the text in a shape-insensitive way by design by anchoring characters individually. APM then interpolates and gathers features flexibly along the character anchors which enables sequence learning. The complementary modules realize a harmonic unification of spatial information and sequence learning. With the proposed modules, our recognition system surpasses previous state-of-the-art scores on irregular and perspective text datasets, including, ICDAR 2015, CUTE, and Total-Text, while paralleling state-of-theart performance on regular text datasets.

show abstract

Section: Methodsmentioning

confidence: 68%

“…The fact that the polygon prediction is shape-sensitive and may not generalize well to unseen shapes limits the potential of rectification-based methods. Similar problem also exists in 2D attention method [10], which is proven by a less competent score on blurred datasets.…”

Section: Introductionmentioning

confidence: 76%

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Guan

Bian

Yao

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Su et al [34,36] converted text images into sequential signals via extracting their HOG features, and designed an ensembling technique to combine the outputs of two LSTM branches, so that better recognition performance could be achieved. Li et al [37] pointed out that traditional attention mechanism was not able to produce accurate attention predictions, thus the recognition performance on irregular text images was largely compromised. To address this issue, they designed a 2-D attention module, where one LSTM was used to encode feature maps column by column to produce holistic features, and another was employed as usual to generate final sequential outputs.…”

Section: Related Workmentioning

confidence: 99%

FACLSTM: ConvLSTM with focused attention for scene text recognition

Wang¹,

Jia²,

He³

et al. 2020

Sci. China Inf. Sci.

View full text Add to dashboard Cite

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Due to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., Focused Attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.

show abstract

“…To validate the effectiveness of our method, we evaluate our PRN on several irregular benchmarks and summarize the results in Table 4 [1, 4-7, 9, 15-20, 30, 36-40]. Considering [41] used extra synthetic and real images for training, we did not compare the results with [41] to ensure fairness. As observed in Table 4, our method outperforms other approaches by a large margin on most benchmarks.…”

Section: Performance On Irregular Benchmarksmentioning

confidence: 99%

Progressive rectification network for irregular text recognition

Gao

Chen

Wang

et al. 2020

Sci. China Inf. Sci.

View full text Add to dashboard Cite

Scene text recognition has received increasing attention in the research community. Text in the wild often possesses irregular arrangements, which typically include perspective, curved, and oriented texts. Most of the existing methods do not work well for irregular text, especially for severely distorted text. In this paper, we propose a novel progressive rectification network (PRN) for irregular scene text recognition. Our PRN progressively rectifies the irregular text to a front-horizontal view and further boosts the recognition performance. The distortions are removed step by step by leveraging the observation that the intermediate rectified result provides good guidance for subsequent higher quality rectification. Additionally, by decomposing the rectification process into multiple procedures, the difficulty of each step is considerably mitigated. First, we specifically perform a rough rectification, and then adopt iterative refinement to gradually achieve optimal rectification. Additionally, to avoid the boundary damage problem in direct iterations, we design an envelope-refinement structure to maintain the integrity of the text during the iterative process. Instead of the rectified images, the text line envelope is tracked and continually refined, which implicitly models the transformation information. Then, the original input image is consistently utilized for transformation based on the refined envelope. In this manner, the original character information is preserved until the final transformation. These designs lead to optimal rectification to boost the performance of succeeding recognition. Extensive experiments on eight challenging datasets demonstrate the superiority of our method, especially on irregular benchmarks.

show abstract

Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition

Cited by 338 publications

References 20 publications

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

FACLSTM: ConvLSTM with focused attention for scene text recognition

Progressive rectification network for irregular text recognition

Contact Info

Product

Resources

About