A New DCT-FFT Fusion Based Method for Caption and Scene Text Classification in Action Video Images

Nandanwar, Lokesh; Shivakumara, Palaiahnakote; Manna, Suvojit; Pal, Umapada; Lü, Tong; Blumenstein, Michael

doi:10.1007/978-3-030-59830-3_7

Cited by 2 publications

(5 citation statements)

References 16 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the methods are not tested on action images without text information. Recently, the method [42] proposes the combination of Discrete Cosine Transform and Fast Fourier Transform for classifying caption and scene texts in action images to improve text recognition results. The method generates a fused image for the input and then the average of sparsity and non-sparsity counts in terms pixel values of zero or non-zeros is computed for classification.…”

Section: Related Workmentioning

confidence: 99%

“…6 Sample images of successful classification of the proposed modelon our dataset. Original Source: [42] https://doi.org/10.1007/s42452-021-04821-z of classes increases, the complexity of the problem also increases. But if we consider the overall performance in terms of classification rate, the proposed method outperforms the others.…”

Section: Concertmentioning

confidence: 99%

“…13 Samples of unsuccessful classification results of the proposed approach on different datasets. Original Source: [42] and [8]…”

Section: Craftmentioning

confidence: 99%

“…The images used in the proposed work are originally taken from [8,42]. The authors of this paper and the authors of [8,42] have given consent to use the data for publication.…”

Section: Declarationsmentioning

confidence: 99%

“…Qualitative results of the text detection[9] on STD dataset before and after classification. Original Source:[8,42]…”

mentioning

confidence: 99%

See 4 more Smart Citations

A deep action-oriented video image classification system for text detection and recognition

et al. 2021

Self Cite

View full text Add to dashboard Cite

For the video images with complex actions, achieving accurate text detection and recognition results is very challenging. This paper presents a hybrid model for classification of action-oriented video images which reduces the complexity of the problem to improve text detection and recognition performance. Here, we consider the following five categories of genres, namely concert, cooking, craft, teleshopping and yoga. For classifying action-oriented video images, we explore ResNet50 for learning the general pixel-distribution level information and the VGG16 network is implemented for learning the features of Maximally Stable Extremal Regions and again another VGG16 is used for learning facial components obtained by a multitask cascaded convolutional network. The approach integrates the outputs of the three above-mentioned models using a fully connected neural network for classification of five action-oriented image classes. We demonstrated the efficacy of the proposed method by testing on our dataset and two other standard datasets, namely, Scene Text Dataset dataset which contains 10 classes of scene images with text information, and the Stanford 40 Actions dataset which contains 40 action classes without text information. Our method outperforms the related existing work and enhances the class-specific performance of text detection and recognition, significantly. Article highlights The method uses pixel, stable-region and face-component information in a noble way for solving complex classification problems. The proposed work fuses different deep learning models for successful classification of action-oriented images. Experiments on our own dataset as well as standard datasets show that the proposed model outperforms related state-of-the-art (SOTA) methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Concertmentioning

confidence: 99%

“…13 Samples of unsuccessful classification results of the proposed approach on different datasets. Original Source: [42] and [8]…”

Section: Craftmentioning

confidence: 99%

“…The images used in the proposed work are originally taken from [8,42]. The authors of this paper and the authors of [8,42] have given consent to use the data for publication.…”

Section: Declarationsmentioning

confidence: 99%

“…Qualitative results of the text detection[9] on STD dataset before and after classification. Original Source:[8,42]…”

mentioning

confidence: 99%

See 3 more Smart Citations

A deep action-oriented video image classification system for text detection and recognition

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

A New Hybrid Method for Caption and Scene Text Classification in Action Video Images

Nandanwar

Shivakumara

Pal

et al. 2021

Int. J. Patt. Recogn. Artif. Intell.

View full text Add to dashboard Cite

Achieving a better recognition rate for text in action video images is challenging due to multiple types of text with unpredictable actions in the background. In this paper, we propose a new method for the classification of caption (which is edited text) and scene text (text that is a part of the video) in video images. This work considers five action classes, namely, Yoga, Concert, Teleshopping, Craft, and Recipes, where it is expected that both types of text play a vital role in understanding the video content. The proposed method introduces a new fusion criterion based on Discrete Cosine Transform (DCT) and Fourier coefficients to obtain the reconstructed images for caption and scene text. The fusion criterion involves computing the variances for coefficients of corresponding pixels of DCT and Fourier images, and the same variances are considered as the respective weights. This step results in Reconstructed image-1. Inspired by the special property of Chebyshev-Harmonic-Fourier-Moments (CHFM) that has the ability to reconstruct a redundancy-free image, we explore CHFM for obtaining the Reconstructed image-2. The reconstructed images along with the input image are passed to a Deep Convolutional Neural Network (DCNN) for classification of caption/scene text. Experimental results on five action classes and a comparative study with the existing methods demonstrate that the proposed method is effective. In addition, the recognition results of the before and after the classification obtained from different methods show that the recognition performance improves significantly after classification, compared to before classification.

show abstract

A New DCT-FFT Fusion Based Method for Caption and Scene Text Classification in Action Video Images

Cited by 2 publications

References 16 publications

A deep action-oriented video image classification system for text detection and recognition

A deep action-oriented video image classification system for text detection and recognition

A New Hybrid Method for Caption and Scene Text Classification in Action Video Images

Contact Info

Product

Resources

About