Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Xue, Chuhui; Zhang, Wenqing; Yu, Han‐Qing; Lu, Shijian; Torr, Philip H. S.; Bai, Song

doi:10.1007/978-3-031-19815-1_17

Cited by 16 publications

(5 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The comparison results are shown in Tab. 7, from which we can see that without pretext tasks for pretraining, DB+FastTCM-CR50 consistently outperforms previous methods including DB+STKM [7], DB+VLPT [6], and DB+oCLIP [8]. Especially on IC15, our method outperforms the previous state-of-the-art pretraining method by a large margin, with 89.5% versus 86.5% in terms of the F-measure.…”

Section: Comparison With Pretraining Methodsmentioning

confidence: 71%

“…Taking inspiration from CLIP, Song et al [6] formulated three pretraining tasks for fine-grained cross-modality interaction, designed to align unimodal embeddings and learn enhanced representations of the backbone. Xue et al [8] proposed a weakly supervised pretraining method, which simultaneously learns and aligns visual and partial text instance information, with the aim of producing effective visual text representations.…”

Section: Cross-modal Pretraining Methodsmentioning

confidence: 99%

“…Wan et al [7] have brought forth an approach that involves a self-attention based text knowledge mining technique to boost the backbone via image-level text recognition pretraining tasks. Meanwhile, Xue et al [8] have introduced a weakly supervised pretraining method aiming to jointly learn and align visual and partial textual information. The goal is to cultivate effective visual text representations applicable to scene text detection and spotting.…”

Section: Introductionmentioning

confidence: 99%

“…ST and VLP denote SynthText pretraining and visual-language pretraining methods, respectively. * stand for the results from[8]. F-measure (%) is reported.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Turning a CLIP Model into a Scene Text Detector

Yu,

Liu,

Hua

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.56% in text detection and spotting tasks, along with a 48.5% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 5.5% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.

show abstract

Section: Comparison With Pretraining Methodsmentioning

confidence: 71%

Section: Cross-modal Pretraining Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…ST and VLP denote SynthText pretraining and visual-language pretraining methods, respectively. * stand for the results from[8]. F-measure (%) is reported.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Turning a CLIP Model into a Scene Text Detector

Yu,

Liu,

Hua

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…DeepSE X Upstage HK team ranks 2nd in the leaderboard. They fundamentally used DBNet [7] as the scene text detector, and leveraged the oCLIP [6] pretrained Swin Transformer-Base [5] model as the backbone to make direct predictions at three different levels. Following DBNet, they employed Balanced Cross-Entropy for binary map and L1 loss for threshold map.…”

Section: Task 1 Methodologymentioning

confidence: 99%

Towards End-to-End Unified Scene Text Detection and Layout Analysis

Qin

Panteleev

Bissacco

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We organize a competition on hierarchical text detection and recognition. The competition is aimed to promote research into deep learning models and systems that can jointly perform text detection and recognition and geometric layout analysis. We present details of the proposed competition organization, including tasks, datasets, evaluations, and schedule. During the competition period (from January 2nd 2023 to April 1st 2023), at least 50 submissions from more than 20 teams were made in the 2 proposed tasks. Considering the number of teams and submissions, we conclude that the HierText competition has been successfully held. In this report, we will also present the competition results and insights from them.

show abstract