SVTR: Scene Text Recognition with a Single Visual Model

Du, Yongkun; Chen, Zhineng; Jia, Caiyan; Yin, Xiaoting; Zheng, Taiying; Li, Chenxia; Du, Yuning; Jiang, Yu‐Gang

doi:10.48550/arxiv.2205.00159

Cited by 16 publications

(19 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Liu et al [48] perform text recognition using feature pyramids. We demonstrate that, without any task-specific engineering, we reconstruct fine details to perform robustly in dark, noisy conditions on SOTA text recognition methods [20,19,26,4,58], such as PARSeq [7].…”

Section: Related Workmentioning

confidence: 99%

Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition

Nguyen¹,

Chan²,

Bergman³

et al. 2023

Preprint

View full text Add to dashboard Cite

Images are indispensable for the automation of highlevel tasks, such as text recognition. Low-light conditions pose a challenge for these high-level perception stacks, which are often optimized on well-lit, artifact-free images. Reconstruction methods for low-light images can produce well-lit counterparts, but typically at the cost of highfrequency details critical for downstream tasks. We propose Diffusion in the Dark (DiD), a diffusion model for lowlight image reconstruction that provides qualitatively competitive reconstructions with that of SOTA, while preserving high-frequency details even in extremely noisy, dark conditions. We demonstrate that DiD, without any task-specific optimization, can outperform SOTA low-light methods in low-light text recognition on real images, bolstering the potential of diffusion models for ill-posed inverse problems.

show abstract

Section: Related Workmentioning

confidence: 99%

Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition

Nguyen¹,

Chan²,

Bergman³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…AutoSTR [44] searches backbone via neural architecture search (NAS) [45]. More recently, semantic-aware [46], [43], transformer-based [47], linguistics-aware [48], [49], and efficient [50], [51] approaches are proposed to further boost the performance. Although these methods are able to handle irregular, occluded, and incomplete text images, they still have difficulty in recognizing low-resolution images.…”

Section: Scene Text Recognitionmentioning

confidence: 99%

Positional Information is a Strong Supervision for Volumetric Medical Image Segmentation

Yu-hong

Hou

Zeng

et al. 2023

J. Shanghai Jiaotong Univ. (Sci.)

View full text Add to dashboard Cite

Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from lowresolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.

show abstract

“…The main body of SVTR-LCNet is optimized by SVTR-T 12 , with the first half consisting of the first three stages of PP-LCNet, and the second half consisting of a convolutional pooling layer, two global mixing modules, and a fully connected layer, as shown in Figure 4. Firstly, adjust the cropped Morse code time-frequency map I 1 to 320×16×3, and then input it into the first three stages of PP-LCNet, as shown in Figure 5.…”

Section: Recognition Model Based On Stvr-lcnetmentioning

confidence: 99%

Morse code detection and recognition algorithm based on YOLO-SVTR

Li,

Wei,

Han

2023

Fourth International Conference on Signal Processing and Computer Science (SPCS 2023)

View full text Add to dashboard Cite

SVTR: Scene Text Recognition with a Single Visual Model

Cited by 16 publications

References 2 publications

Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition

Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition

Positional Information is a Strong Supervision for Volumetric Medical Image Segmentation

Morse code detection and recognition algorithm based on YOLO-SVTR

Contact Info

Product

Resources

About