2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01393
|View full text |Cite
|
Sign up to set email alerts
|

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
33
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 102 publications
(53 citation statements)
references
References 34 publications
0
33
0
Order By: Relevance
“…Also, some works [21,58,82] attempt to construct text recognizers based on Transformer [64], which is thrived in the field of natural language processing, to robustly learn the representations for text images through self-attention modules. Recently, some researchers incorporated semantic knowledge into text recognizers to fully exploit the external language priors [20,54,75,88]. For example, SEED [54] utilizes text embedding guided by fastText [8] to initialize the attention-based decoder.…”
Section: Existing Text Recognition Methodsmentioning
confidence: 99%
“…Also, some works [21,58,82] attempt to construct text recognizers based on Transformer [64], which is thrived in the field of natural language processing, to robustly learn the representations for text images through self-attention modules. Recently, some researchers incorporated semantic knowledge into text recognizers to fully exploit the external language priors [20,54,75,88]. For example, SEED [54] utilizes text embedding guided by fastText [8] to initialize the attention-based decoder.…”
Section: Existing Text Recognition Methodsmentioning
confidence: 99%
“…To utilize the benefits of a bi-directional Transformer, the non-autoregressive decoder has been introduced in STR community. The general decoding process of them [1,6,18,25,28] lies in the effective construction of a sequence processed parallelly in the decoder. Specifically, positional embeddings describing the order of the output sequence are used to align visual (or semantic) features.…”
Section: Related Workmentioning
confidence: 99%
“…They typically consist of a visual feature extractor, abstracting the image patch, and a character sequence generator, responsible for character decoding. Despite wide explorations to find better visual feature extractors and character sequence generators, existing methods still suffer from challenging environments: occlusion, (a) STR methods w/ an LM [6,28] (b) Semantic-MATRN and [3] (c) Visual-MATRN and [25] (d) MATRN blurs, distortions, and other artifacts [2,3].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations