Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Chen, Jingye; Yu, Hao; Ma, Jing; Guan, Mengnan; Xu, Xiping; Wang, Xiaocong; Qu, Shaobo; Li, Bin; Xue, Xiangyang

doi:10.48550/arxiv.2112.15093

Cited by 5 publications

(15 citation statements)

References 80 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Sequence-to-Sequence models (Zhang et al 2020b;Wang et al 2019;Sheng, Chen, and Xu 2019;Bleeker and de Rijke 2019;Lee et al 2020;Atienza 2021;Chen et al 2021) are gradually attracting more attention, especially after the advent of the Transformer architecture (Vaswani et al 2017). SaHAN (Zhang et al 2020b), standing for the scaleaware hierarchical attention network, are proposed to address the character scale-variation issue.…”

Section: Related Work Scene Text Recognitionmentioning

confidence: 99%

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Chen

et al. 2023

AAAI

View full text Add to dashboard Cite

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.

show abstract

Section: Related Work Scene Text Recognitionmentioning

confidence: 99%

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Chen

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…Different from previous methods, we only utilize text labels instead of pixel-wise labels. The bi-lingual text-related benchmark [18] can also be used for training the recognition module. Implementation Details The size of the input is set to 48×160 and the number of channels for the fused feature is set to 512.…”

Section: Datasets and Implementation Details Datasetsmentioning

confidence: 99%

Weakly-Supervised Text Instance Segmentation

Zu¹,

Yu²,

Li³

et al. 2023

Preprint

View full text Add to dashboard Cite

Text segmentation is a challenging vision task with many downstream applications. Current text segmentation methods require pixel-level annotations, which are expensive in the cost of human labor and limited in application scenarios. In this paper, we take the first attempt to perform weaklysupervised text instance segmentation by bridging text recognition and text segmentation. The insight is that text recognition methods provide precise attention position of each text instance, and the attention location can feed to both a text adaptive refinement head (TAR) and a text segmentation head. Specifically, the proposed TAR generates pseudo labels by performing two-stage iterative refinement operations on the attention location to fit the accurate boundaries of the corresponding text instance. Meanwhile, the text segmentation head takes the rough attention location to predict segmentation masks which are supervised by the aforementioned pseudo labels. In addition, we design a mask-augmented contrastive learning by treating our segmentation result as an augmented version of the input text image, thus improving the visual representation and further enhancing the performance of both recognition and segmentation. The experimental results demonstrate that the proposed method significantly outperforms weakly-supervised instance segmentation methods on ICDAR13-FST (18.95% improvement) and TextSeg (17.80% improvement) benchmarks.

show abstract

“…Scene text recognition (STR) [33], [1], [2], [34], [35] has made great progress in recent years. Specifically, CRNN [36] takes CNN and RNN as the encoder and employs a CTCbased [37] decoder to maximize the probabilities of paths that can reach the ground truth.…”

Section: Scene Text Recognitionmentioning

confidence: 99%

Positional Information is a Strong Supervision for Volumetric Medical Image Segmentation

Yu-hong

Hou

Zeng

et al. 2023

J. Shanghai Jiaotong Univ. (Sci.)

View full text Add to dashboard Cite

Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from lowresolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.

show abstract

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Cited by 5 publications

References 80 publications

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Weakly-Supervised Text Instance Segmentation

Positional Information is a Strong Supervision for Volumetric Medical Image Segmentation

Contact Info

Product

Resources

About