2022
DOI: 10.1609/aaai.v36i2.20062
|View full text |Cite
|
Sign up to set email alerts
|

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Abstract: We introduce Perceiving Stroke-Semantic Context (PerSec), a new approach to self-supervised representation learning tailored for Scene Text Recognition (STR) task. Considering scene text images carry both visual and semantic properties, we equip our PerSec with dual context perceivers which can contrast and learn latent representations from low-level stroke and high-level semantic contextual spaces simultaneously via hierarchical contrastive learning on unlabeled text image data. Experiments in un- and semi-su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(17 citation statements)
references
References 30 publications
0
12
0
Order By: Relevance
“…Our small model achieves the best performance over the previous counterparts with the similar model size. Specifically, our method achieves better accuracy than PerSec [30] which pretrained with 100 million real data, while we only use 4.2 million real data for pretraining. ABINet [12] and its extension ConCLR [54], performing similarly to our approach with small ViT, use an explicit pretrained masked language model to correct the predicted result iteratively, which is complementary to our approach and may benefit our approach.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Our small model achieves the best performance over the previous counterparts with the similar model size. Specifically, our method achieves better accuracy than PerSec [30] which pretrained with 100 million real data, while we only use 4.2 million real data for pretraining. ABINet [12] and its extension ConCLR [54], performing similarly to our approach with small ViT, use an explicit pretrained masked language model to correct the predicted result iteratively, which is complementary to our approach and may benefit our approach.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…The proposed framework was effective with both CTC decoders and attention decoders. PerSec [34] also introduced contrastive learning for text recognition. It presented hierarchical contrastive learning that drove each element of features at high and low levels.…”
Section: Self-supervised Text Recognitionmentioning
confidence: 99%
“…Data augmentation For effective contrastive learning, the data augmentation plays an essential role in the whole framework. Following [34], besides the augmentation operations used in SeqCLR, more image augmentations including color jitter and gray are employed to enhance the representation quality of the pre-trained model.…”
Section: Contrastive Learningmentioning
confidence: 99%
See 2 more Smart Citations