Scene Text Detection Using Attention with Depthwise Separable Convolutions

Hassan, Ehtesham; Lekshmi, V. L.

doi:10.3390/app12136425

Cited by 9 publications

(5 citation statements)

References 73 publications

(95 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Attention mechanisms have become an indispensable tool for designing advanced deep-learning models across various tasks and domains. To preserve more text feature information during feature extraction, we propose an improved attention feature fusion module (DSAF) based on AFF [ 32 ] which uses depthwise separable convolution [ 33 ] and embeds it into the feature extraction network ResNet [ 34 ] to reduce the loss of feature information and increase the degree of attention to features of different scales. The structure of this module is illustrated in Figure 2 .…”

Section: Methodsmentioning

confidence: 99%

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

Li,

Wang,

Huang

et al. 2024

Sensors

View full text Add to dashboard Cite

Scene text detection is an important research field in computer vision, playing a crucial role in various application scenarios. However, existing scene text detection methods often fail to achieve satisfactory results when faced with text instances of different sizes, shapes, and complex backgrounds. To address the challenge of detecting diverse texts in natural scenes, this paper proposes a multi-scale natural scene text detection method based on attention feature extraction and cascaded feature fusion. This method combines global and local attention through an improved attention feature fusion module (DSAF) to capture text features of different scales, enhancing the network’s perception of text regions and improving its feature extraction capabilities. Simultaneously, an improved cascaded feature fusion module (PFFM) is used to fully integrate the extracted feature maps, expanding the receptive field of features and enriching the expressive ability of the feature maps. Finally, to address the cascaded feature maps, a lightweight subspace attention module (SAM) is introduced to partition the concatenated feature maps into several sub-space feature maps, facilitating spatial information interaction among features of different scales. In this paper, comparative experiments are conducted on the ICDAR2015, Total-Text, and MSRA-TD500 datasets, and comparisons are made with some existing scene text detection methods. The results show that the proposed method achieves good performance in terms of accuracy, recall, and F-score, thus verifying its effectiveness and practicality.

show abstract

Section: Methodsmentioning

confidence: 99%

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

Li,

Wang,

Huang

et al. 2024

Sensors

View full text Add to dashboard Cite

show abstract

“…The difficulty in this detection task was that the detection frame rate of the model needed to be higher than the video frame rate in order to achieve the effect of the real-time detection. Current designs for lightweight networks were mainly applied in the following areas: the first was the lightweight design of convolutional layers, such as deep separable convolution [ 68 , 69 , 70 ]. The second was the design of convolutional modules, e.g., the annealing module used in Squeeze Net to achieve light-weighting by reducing the network parameters [ 71 , 72 ].…”

Section: Methodsmentioning

confidence: 99%

Table Tennis Track Detection Based on Temporal Feature Multiplexing Network

Liu

et al. 2023

Sensors

View full text Add to dashboard Cite

Recording the trajectory of table tennis balls in real-time enables the analysis of the opponent’s attacking characteristics and weaknesses. The current analysis of the ball paths mainly relied on human viewing, which lacked certain theoretical data support. In order to solve the problem of the lack of objective data analysis in the research of table tennis competition, a target detection algorithm-based table tennis trajectory extraction network was proposed to record the trajectory of the table tennis movement in video. The network improved the feature reuse rate in order to achieve a lightweight network and enhance the detection accuracy. The core of the network was the “feature store & return” module, which could store the output of the current network layer and pass the features to the input of the network layer at the next moment to achieve efficient reuse of the features. In this module, the Transformer model was used to secondarily process the features, build the global association information, and enhance the feature richness of the feature map. According to the designed experiments, the detection accuracy of the network was 96.8% for table tennis and 89.1% for target localization. Moreover, the parameter size of the model was only 7.68 MB, and the detection frame rate could reach 634.19 FPS using the hardware for the tests. In summary, the network designed in this paper has the characteristics of both lightweight and high precision in table tennis detection, and the performance of the proposed model significantly outperforms that of the existing models.

show abstract

“…After the CNN were used to extract features, the performance of the STD model began to depend on the design of special components, like Region Proposal Network (RPN), Feature Pyramid Network (FPN) [13,14], anchors, and other factors [15,16]. These algorithms required a lot of prior knowledge and complex post-processing steps.…”

Section: Related Workmentioning

confidence: 99%

“…Nowadays, more and more research about images introduces the Transformer and abandons traditional CNN [16]. Vision Transformer (ViT) [26] improved Transformer to classify images.…”

Section: Related Workmentioning

confidence: 99%

CA-STD: Scene Text Detection in Arbitrary Shape Based on Conditional Attention

Song

et al. 2022

Information

View full text Add to dashboard Cite

Scene Text Detection (STD) is critical for obtaining textual information from natural scenes, serving for automated driving and security surveillance. However, existing text detection methods fall short when dealing with the variation in text curvatures, orientations, and aspect ratios in complex backgrounds. To meet the challenge, we propose a method called CA-STD to detect arbitrarily shaped text against a complicated background. Firstly, a Feature Refinement Module (FRM) is proposed to enhance feature representation. Additionally, the conditional attention mechanism is proposed not only to decouple the spatial and textual information from scene text images, but also to model the relationship among different feature vectors. Finally, the Contour Information Aggregation (CIA) is presented to enrich the feature representation of text contours by considering circular topology and semantic information simultaneously to obtain the detection curves with arbitrary shapes. The proposed CA-STD method is evaluated on different datasets with extensive experiments. On the one hand, the CA-STD outperforms state-of-the-art methods and achieves 82.9 in precision on the dataset of TotalText. On the other hand, the method has better performance than state-of-the-art methods and achieves the F1 score of 83.8 on the dataset of CTW-1500. The quantitative and qualitative analysis proves that the CA-STD can detect variably shaped scene text effectively.

show abstract

Scene Text Detection Using Attention with Depthwise Separable Convolutions

Cited by 9 publications

References 73 publications

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

Table Tennis Track Detection Based on Temporal Feature Multiplexing Network

CA-STD: Scene Text Detection in Arbitrary Shape Based on Conditional Attention

Contact Info

Product

Resources

About