Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting

Liang, Qiao; Tang, Sanli; Cheng, Zhanzhan; Xu, Yunlu; Niu, Yi; Pu, Shiliang; Wu, Fei

doi:10.1609/aaai.v34i07.6864

Cited by 79 publications

(60 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A. Spatial-Temporal Video Text Detector 1) Text Detection in Single Frame: Since TP [17] is a more robust text spotter than EAST [16] especially on irregular text detection, and also can be trained end-to-end. We redesign and implement the video text detector inspired by the TP architecture (including a text detection module, a shape transform module and a recognition module), as shown in Figure 2.…”

Section: Methodsmentioning

confidence: 99%

“…With the rapid development of artificial intelligence techniques [18], [19], [20], [21], great progress has been made in many isolated applications such as causal inference [22], named entities identification [23], question answering [24], scene text spotting [5], [6], [17] and video understanding [25], [26]. However, it is very important to build multiple knowledge representation [27] for understanding the real and complex world.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, in order to sufficiently exploit the complementarity between detection and recognition, many methods [45], [4], [5], [6], [46], [7], [17], [47], [37], [48], [49], [50] are proposed to spot text in an end-to-end manner, which utilize the recognition information to optimize the localization task.…”

Section: A Text Reading In Single Imagesmentioning

confidence: 99%

“…In fact, it is a very challenging task to optimize video text spotter end-to-end when taking multiple functional modules (text detection, text tracking and text recognition) into consideration, especially compared to the traditional four-staged pipeline strategy. Therefore, in this paper we develop an endto-end trainable video text spotter with only two trainable modules: the video text detector and the text recommender, similar to the end-to-end text spotting methods [6], [17], [45], [47], [48], [49] in single images.…”

Section: B Text Reading In Videosmentioning

confidence: 99%

“…Declaration of major extensions compared to the conference version [8]: (1) We achieve the video text spotting in an end-to-end trainable manner instead of the two-staged form in its conference version. To achieve this, we replace EAST [16] with an end-to-end trainable text spotting framework Text Perceptron [17] (abbr. TP), in which the original recognition module in TP is replaced with our text recommender submodule.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

FREE: A Fast and Robust End-to-End Video Text Spotter

Cheng

Lü

Zou

et al. 2021

IEEE Trans. on Image Process.

Self Cite

View full text Add to dashboard Cite

Currently, video text spotting tasks usually fall into the four-staged pipeline: detecting text regions in individual images, recognizing localized text regions frame-wisely, tracking text streams and post-processing to generate final results. However, they may suffer from the huge computational cost as well as suboptimal results due to the interferences of low-quality text and the none-trainable pipeline strategy. In this paper, we propose a fast and robust end-to-end video text spotting framework named FREE by only recognizing the localized text stream onetime instead of frame-wise recognition. Specifically, FREE first employs a well-designed spatial-temporal detector that learns text locations among video frames. Then a novel text recommender is developed to select the highest-quality text from text streams for recognizing. Here, the recommender is implemented by assembling text tracking, quality scoring and recognition into a trainable module. It not only avoids the interferences from the low-quality text but also dramatically speeds up the video text spotting. FREE unites the detector and recommender into a whole framework, and helps achieve global optimization. Besides, we collect a large scale video text dataset for promoting the video text spotting community, containing 100 videos from 21 real-life scenarios. Extensive experiments on public benchmarks show our method greatly speeds up the text spotting process, and also achieves the remarkable state-of-the-art.

show abstract

Section: Methodsmentioning

confidence: 99%