Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Líu, Hao; Wang, Bin; Bao, Zhimin; Xue, Mobai; Kang, Sheng; Jiang, Deqiang; Liu, Yinsong; Ren, Bo

doi:10.1609/aaai.v36i2.20062

Cited by 24 publications

(17 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our small model achieves the best performance over the previous counterparts with the similar model size. Specifically, our method achieves better accuracy than PerSec [30] which pretrained with 100 million real data, while we only use 4.2 million real data for pretraining. ABINet [12] and its extension ConCLR [54], performing similarly to our approach with small ViT, use an explicit pretrained masked language model to correct the predicted result iteratively, which is complementary to our approach and may benefit our approach.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Lyu¹,

Zhang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Lyu¹,

Zhang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The proposed framework was effective with both CTC decoders and attention decoders. PerSec [34] also introduced contrastive learning for text recognition. It presented hierarchical contrastive learning that drove each element of features at high and low levels.…”

Section: Self-supervised Text Recognitionmentioning

confidence: 99%

“…Data augmentation For effective contrastive learning, the data augmentation plays an essential role in the whole framework. Following [34], besides the augmentation operations used in SeqCLR, more image augmentations including color jitter and gray are employed to enhance the representation quality of the pre-trained model.…”

Section: Contrastive Learningmentioning

confidence: 99%

“…Handwritten Text Recognition Benchmarks We verify our text recognition models on two handwritten text recognition benchmarks: CVL [26] and IAM [38]. Following SeqCLR [1] and Per-Sec [34] [36] with a cosine learning rate scheduler, and train the models for 3 epochs. The training hyperparameters are: the batch size as 1,024, base learning rate as 1.5e-4, weight decay as 0.05, 𝛽 1 = 0.9, 𝛽 2 = 0.95, warm-up for 5,000 steps.…”

Section: Text Image Super-resolutionmentioning

confidence: 99%

“…SeqCLR [1] proposed a sequence-to-sequence contrastive learning framework for text recognition. PerSec [34] introduced a hierarchical contrastive learning method for text recognition. They are both based on contrastive learning and mainly focus on learning the discrimination of text images, as shown in Fig.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition

Yang¹,

Liao²,

Pu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain.

show abstract