FREE: A Fast and Robust End-to-End Video Text Spotter

Cheng, Zhanzhan; Lü, Jing; Zou, Baorui; Liang, Qiao; Xu, Yunlu; Pu, Shiliang; Niu, Yi; Wu, Fei; Zhou, Shuigeng

doi:10.1109/tip.2020.3038520

Cited by 21 publications

(29 citation statements)

References 91 publications

(147 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…- Annotations. The annotation strategy is same to LSVTD [4]. For each text region, the annotation items is as follows: (1) Polygon coordinate represents text location.…”

Section: Datasetmentioning

confidence: 99%

“…However, in many real-world applications, the sequence-level spotting results are the most urgently needed for users, while it's not what the user cares about for the framewise recognition results. Therefore, we propose the sequence-level evaluation protocals to evaluate the end-to-end performance, i.e., Recall s , Precision s , F-score s as used in [4]. Here, a predicted text sequence is regarded as a true postive if and only if it satisfies two constraints:…”

Section: Task 3-end-to-end Video Text Spottingmentioning

confidence: 99%

“…This is mainly due to: (1) Though 'Text in Videos' challenges have been recognized since 2013, the dataset is too small (containing only 49 videos from 7 different scenarios), which constrains the research on SVTS. (2) The lack of uniform evaluation metrics and benchmarks, as described in the literature [3,4]. For example, many methods only evaluate their localization performance on YVT and 'Text in Videos' [13,14], but few methods pay attention on the end-to-end evaluation.…”

Section: Introductionmentioning

confidence: 99%

“…-Inherited from LSVTD [3,4], the video text dataset is further extended, containing 129 video clips from 21 real-life scenarios. Compared to the existing ICDAR video text reading datasets, the extended dataset has some special features and challenges.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ICDAR 2021 Competition on Scene Video Text Spotting

Cheng¹,

Lü²,

Zou³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Scene video text spotting (SVTS) is a very important research topic because of many real-life applications. However, only a little effort has put to spotting scene video text, in contrast to massive studies of scene text spotting in static images. Due to various environmental interferences like motion blur, spotting scene video text becomes very challenging. To promote this research area, this competition introduces a new challenge dataset containing 129 video clips from 21 natural scenarios in full annotations. The competition containts three tasks, that is, video text detection (Task 1), video text tracking (Task 2) and end-to-end video text spotting (Task3). During the competition period (opened on 1st March, 2021 and closed on 11th April, 2021), a total of 24 teams participated in the three proposed tasks with 46 valid submissions, respectively. This paper includes dataset descriptions, task definitions, evaluation protocols and results summaries of the ICDAR 2021 on SVTS competition. Thanks to the healthy number of teams as well as submissions, we consider that the SVTS competition has been successfully held, drawing much attention from the community and promoting the field research and its development.

show abstract

“…- Annotations. The annotation strategy is same to LSVTD [4]. For each text region, the annotation items is as follows: (1) Polygon coordinate represents text location.…”

Section: Datasetmentioning

confidence: 99%

Section: Task 3-end-to-end Video Text Spottingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ICDAR 2021 Competition on Scene Video Text Spotting

Cheng¹,

Lü²,

Zou³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Yu et al [21] try to learn the feature embedding in an online association manner. Cheng et al [22] and [23] propose a tracker for the text instances by using the metric-learning method. For these tracking-by-detection approaches, the association step fully depends on the first step, i.e., detecting text in the spatial dimension.…”

Section: Introductionmentioning

confidence: 99%

Video Text Tracking With a Spatio-Temporal Complementary Model

Gao,

Li,

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

Text tracking is to track multiple texts in a video, and construct a trajectory for each text. Existing methods tackle this task by utilizing the tracking-by-detection framework, i.e., detecting the text instances in each frame and associating the corresponding text instances in consecutive frames. We argue that the tracking accuracy of this paradigm is severely limited in more complex scenarios, e.g., owing to motion blur, etc., the missed detection of text instances causes the break of the text trajectory. In addition, different text instances with similar appearance are easily confused, leading to the incorrect association of the text instances. To this end, a novel spatio-temporal complementary text tracking model is proposed in this paper. We leverage a Siamese Complementary Module to fully exploit the continuity characteristic of the text instances in the temporal dimension, which effectively alleviates the missed detection of the text instances, and hence ensures the completeness of each text trajectory. We further integrate the semantic cues and the visual cues of the text instance into a unified representation via a text similarity learning network, which supplies a high discriminative power in the presence of text instances with similar appearance, and thus avoids the misassociation between them. Our method achieves state-of-the-art performance on several public benchmarks. The source code is available at https://github.com/lsabrinax/VideoTextSCM.

show abstract