2022
DOI: 10.1007/978-3-031-19815-1_17
|View full text |Cite
|
Sign up to set email alerts
|

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(5 citation statements)
references
References 62 publications
0
5
0
Order By: Relevance
“…The comparison results are shown in Tab. 7, from which we can see that without pretext tasks for pretraining, DB+FastTCM-CR50 consistently outperforms previous methods including DB+STKM [7], DB+VLPT [6], and DB+oCLIP [8]. Especially on IC15, our method outperforms the previous state-of-the-art pretraining method by a large margin, with 89.5% versus 86.5% in terms of the F-measure.…”
Section: Comparison With Pretraining Methodsmentioning
confidence: 71%
See 3 more Smart Citations
“…The comparison results are shown in Tab. 7, from which we can see that without pretext tasks for pretraining, DB+FastTCM-CR50 consistently outperforms previous methods including DB+STKM [7], DB+VLPT [6], and DB+oCLIP [8]. Especially on IC15, our method outperforms the previous state-of-the-art pretraining method by a large margin, with 89.5% versus 86.5% in terms of the F-measure.…”
Section: Comparison With Pretraining Methodsmentioning
confidence: 71%
“…Taking inspiration from CLIP, Song et al [6] formulated three pretraining tasks for fine-grained cross-modality interaction, designed to align unimodal embeddings and learn enhanced representations of the backbone. Xue et al [8] proposed a weakly supervised pretraining method, which simultaneously learns and aligns visual and partial text instance information, with the aim of producing effective visual text representations.…”
Section: Cross-modal Pretraining Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…DeepSE X Upstage HK team ranks 2nd in the leaderboard. They fundamentally used DBNet [7] as the scene text detector, and leveraged the oCLIP [6] pretrained Swin Transformer-Base [5] model as the backbone to make direct predictions at three different levels. Following DBNet, they employed Balanced Cross-Entropy for binary map and L1 loss for threshold map.…”
Section: Task 1 Methodologymentioning
confidence: 99%