2022
DOI: 10.48550/arxiv.2205.00159
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SVTR: Scene Text Recognition with a Single Visual Model

Abstract: Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character comp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(19 citation statements)
references
References 2 publications
0
10
0
Order By: Relevance
“…Liu et al [48] perform text recognition using feature pyramids. We demonstrate that, without any task-specific engineering, we reconstruct fine details to perform robustly in dark, noisy conditions on SOTA text recognition methods [20,19,26,4,58], such as PARSeq [7].…”
Section: Related Workmentioning
confidence: 99%
“…Liu et al [48] perform text recognition using feature pyramids. We demonstrate that, without any task-specific engineering, we reconstruct fine details to perform robustly in dark, noisy conditions on SOTA text recognition methods [20,19,26,4,58], such as PARSeq [7].…”
Section: Related Workmentioning
confidence: 99%
“…AutoSTR [44] searches backbone via neural architecture search (NAS) [45]. More recently, semantic-aware [46], [43], transformer-based [47], linguistics-aware [48], [49], and efficient [50], [51] approaches are proposed to further boost the performance. Although these methods are able to handle irregular, occluded, and incomplete text images, they still have difficulty in recognizing low-resolution images.…”
Section: Scene Text Recognitionmentioning
confidence: 99%
“…The main body of SVTR-LCNet is optimized by SVTR-T 12 , with the first half consisting of the first three stages of PP-LCNet, and the second half consisting of a convolutional pooling layer, two global mixing modules, and a fully connected layer, as shown in Figure 4. Firstly, adjust the cropped Morse code time-frequency map I 1 to 320×16×3, and then input it into the first three stages of PP-LCNet, as shown in Figure 5.…”
Section: Recognition Model Based On Stvr-lcnetmentioning
confidence: 99%