2021
DOI: 10.48550/arxiv.2106.12326
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Open Images V5 Text Annotation and Yet Another Mask Text Spotter

Abstract: A large scale human-labeled dataset plays an important role in creating high quality deep learning models. In this paper we present text annotation for Open Images V5 dataset. To our knowledge it is the largest among publicly available manually created text annotations. Having this annotation we trained a simple Mask-RCNN-based network, referred as Yet Another Mask Text Spotter (YAMTS), which achieves competitive performance or even outperforms current state-of-the-art approaches in some cases on ICDAR 2013, I… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…Dai et al [79] introduced the Region-based Fully Convolutional Network (R-FCN), which uses fully connected layers that share almost all processing across the entire image instead of convolutional layers. They proposed position-sensitive score maps to address the translation invariance problem, which includes:…”
Section: ) R-fcnmentioning
confidence: 99%
“…Dai et al [79] introduced the Region-based Fully Convolutional Network (R-FCN), which uses fully connected layers that share almost all processing across the entire image instead of convolutional layers. They proposed position-sensitive score maps to address the translation invariance problem, which includes:…”
Section: ) R-fcnmentioning
confidence: 99%
“…It was replaced with RoIAlign [11] that used a bilinear interpolation for weighted feature sampling, that was also extended for the first time for sampling non axis-aligned (i.e., rotated ) RoIs [27]. For sampling arbitrarily shaped text, further extensions [33,16,20] added a background mask to the sampling operation for isolating the extracted word only, often relying on segmentation-based detectors or masks.…”
Section: Background and Related Workmentioning
confidence: 99%
“…The components in this approach are mostly explored independently in the literature, isolating either the word detection performance (ignoring transcripts) [47,4,2,21,39], or the recognition performance over datasets composed of word-crop images [1,41,25,31]. The second approach is a combined End-to-End (E2E) architecture, adding a recognition branch that operates directly on the detection model's latent features [8,3,16,36,33,20,29]. Feature sampling replaces cropping, allowing detection and recognition to be jointly trained E2E.…”
Section: Introductionmentioning
confidence: 99%