From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Wang, Yuxin; Xie, Hongtao; Fang, Shancheng; Wang, Jing; Zhu, Shenggao; Zhang, Yongdong

doi:10.1109/iccv48922.2021.01393

Cited by 102 publications

(53 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, some works [21,58,82] attempt to construct text recognizers based on Transformer [64], which is thrived in the field of natural language processing, to robustly learn the representations for text images through self-attention modules. Recently, some researchers incorporated semantic knowledge into text recognizers to fully exploit the external language priors [20,54,75,88]. For example, SEED [54] utilizes text embedding guided by fastText [8] to initialize the attention-based decoder.…”

Section: Existing Text Recognition Methodsmentioning

confidence: 99%

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Chen¹,

Yu²,

Ma³

et al. 2021

Preprint

View full text Add to dashboard Cite

The flourishing blossom of deep learning has witnessed the rapid development of text recognition in recent years. However, the existing text recognition methods are mainly for English texts, whereas ignoring the pivotal role of Chinese texts. As another widely-spoken language, Chinese text recognition in all ways has extensive application markets. Based on our observations, we attribute the scarce attention on Chinese text recognition to the lack of reasonable dataset construction standards, unified evaluation methods, and results of the existing baselines. To fill this gap, we manually collect Chinese text datasets from publicly available competitions, projects, and papers, then divide them into four categories including scene, web, document, and handwriting datasets. Furthermore, we evaluate a series of representative text recognition methods on these datasets with unified evaluation methods to provide experimental results. By analyzing the experimental results, we surprisingly observe that state-of-the-art baselines for recognizing English texts cannot perform well on Chinese scenarios. We consider that there still remain numerous challenges under exploration due to the characteristics of Chinese texts, which are quite different from English texts. The code and datasets are made publicly available at https://github.com/ FudanVI/benchmarking-chinese-text-recognition. Figure 1. Three reasons for the scarce attention of Chinese text recognition. (a) People may use different ways to crop text regions, which leads to unfair comparison. (b) It is necessary to specify the equivalence between lowercase and uppercase, half-width and full-width, simplified and traditional characters. (c) The existing methods are mainly evaluated with English datasets rather than Chinese datasets.

show abstract

Section: Existing Text Recognition Methodsmentioning

confidence: 99%

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Chen¹,

Yu²,

Ma³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To utilize the benefits of a bi-directional Transformer, the non-autoregressive decoder has been introduced in STR community. The general decoding process of them [1,6,18,25,28] lies in the effective construction of a sequence processed parallelly in the decoder. Specifically, positional embeddings describing the order of the output sequence are used to align visual (or semantic) features.…”

Section: Related Workmentioning

confidence: 99%

“…They typically consist of a visual feature extractor, abstracting the image patch, and a character sequence generator, responsible for character decoding. Despite wide explorations to find better visual feature extractors and character sequence generators, existing methods still suffer from challenging environments: occlusion, (a) STR methods w/ an LM [6,28] (b) Semantic-MATRN and [3] (c) Visual-MATRN and [25] (d) MATRN blurs, distortions, and other artifacts [2,3].…”

Section: Introductionmentioning

confidence: 99%

“…At each stage, a character sequence, designed as differentiable with Gumbel-softmax, is re-fined by re-assessing visual clues. Concurrently, Wang et al [25] propose a language-aware visual mask that occludes selected character regions for enhancing the visual clues at the training phase. They prove that combining visual clues and semantic knowledge leads to better STR performances.…”

Section: Introductionmentioning

confidence: 99%

“…They prove that combining visual clues and semantic knowledge leads to better STR performances. Inspired by [3] and [25], we raise a novel question: what is the best way to model the interactions between visual and semantic features identified by VM and LM, respectively?…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Na¹,

Kim²,

Park³

2021

Preprint

View full text Add to dashboard Cite

Linguistic knowledge has brought great benefits to scene text recognition by providing semantics to refine character sequences. However, since linguistic knowledge has been applied individually on the output sequence, previous methods have not fully utilized the semantics to understand visual clues for text recognition. This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances. Specifically, MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Based on the spatial encoding, visual and semantic features are enhanced by referring to related features in the other modality. Furthermore, MATRN stimulates combining semantic features into visual features by hiding visual clues related to the character in the training phase. Our experiments demonstrate that MATRN achieves state-of-theart performances on seven benchmarks with large margins, while naive combinations of two modalities show marginal improvements. Further ablative studies prove the effectiveness of our proposed components. Our implementation will be publicly available.

show abstract

Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval

Li¹,

Xie²,

Ge³

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Cited by 102 publications

References 34 publications

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval

Contact Info

Product

Resources

About