Transformer-based Text Detection in the Wild

Raisi, Zobeir; Naiel, Mohamed A.; Younes, Georges; Wardell, Steven; Zelek, John

doi:10.1109/cvprw53098.2021.00353

Cited by 39 publications

(17 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Early regression-based methods such as TextBoxes++ [5] and EAST [6] used SSD's [23] architecture to detect text regions with rotated rectangles or quadrilateral descriptions. More recently, [31] extended DTER's [29] architecture to output rotated rectangular boxes directly and achieved SOTA performance in multi-oriented benchmark datasets. However, these representations ignore the geometric traits of the arbitrary shape of curved texts and end up producing considerable background noise.…”

Section: B Regression-based Methodsmentioning

confidence: 99%

“…To achieve this, we modify the prediction head of deformable DETR's architecture [32] to output 16 parameters that represent the Bezier control points. However, unlike [29] and [32] that use a generic Generalized Intersection over Union (GIoU) with 1 -regression [39] (shown in Figure 1(a)), we propose a split GIoU loss for Bezier control points of (3) (shown in Figure 2), along with a Smooth-ln regression based loss [31].…”

Section: B Proposed Systemmentioning

confidence: 99%

“…where λ 1 and λ 2 ∈ R are hyper-parameters, and L B reg (•) and L B GIoU (•) are the Bezier-curved loss functions based on regression and GIoU. For regression, we use the Smooth-ln based Regression Loss as in [31]. The regression loss is then defined as:…”

Section: B Proposed Systemmentioning

confidence: 99%

“…As for ICDAR15, we further pre-train the models using about 10, 000 images of ICDAR17 [43] dataset for 50 epochs and then fine-tune for about 300 epochs to ensure the training converges. For calculating the rotated version of bounding box loss function, we used the method described in [31].…”

Section: A Implementation Detailsmentioning

confidence: 99%

“…Recent advancements in object detection enabled Transformer frameworks [26][27][28] like DETR (Detection Transformer) [29] to eliminate the need for many of the existing handcrafted post-processing steps such as anchor generation, and non-maximum suppression (NMS) from the object detection pipeline [21,23,24,30], all while achieving superior performance. For example, Raisi et al [31], leveraged the DETR [29] architecture for multi-oriented scene text detection and achieved SOTA performance in some benchmark datasets. Nevertheless, DETR has difficulties detecting small objects and suffers from a slow convergence rate.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Arbitrary Shape Text Detection using Transformers

Raisi¹,

Younes²,

Zelek³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recent text detection frameworks require several handcrafted components such as anchor generation, nonmaximum suppression (NMS), or multiple processing stages (e.g. label generation) to detect arbitrarily shaped text images. In contrast, we propose an end-to-end trainable architecture based on Detection using Transformers (DETR), that outperforms previous state-of-the-art methods in arbitrary-shaped text detection.At its core, our proposed method leverages a bounding box loss function that accurately measures the arbitrary detected text regions' changes in scale and aspect ratio. This is possible due to a hybrid shape representation made from Bezier curves, that are further split into piece-wise polygons. The proposed loss function is then a combination of a generalized-split-intersection-overunion loss defined over the piece-wise polygons, and regularized by a Smooth-ln regression over the Bezier curve's control points.We evaluate our proposed model using Total-Text and CTW-1500 datasets for curved text, and MSRA-TD500 and ICDAR15 datasets for multi-oriented text, and show that the proposed method outperforms the previous state-of-the-art methods in arbitrary-shape text detection tasks.

show abstract