Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network

Bhunia, Ankan Kumar; Konwer, Aishik; Bhunia, Ayan Kumar; Bhowmick, Abir; Roy, Partha Pratim; Pal, Umapada

doi:10.1016/j.patcog.2018.07.034

Cited by 118 publications

(45 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“… In human visual system, attention is one of the important mechanisms in capturing information from images. Attention mechanism operates in such a way that it not only extracts the essential information from image, but also stores its contextual relation with other components of image [243]. In future, research may be carried out in the direction that preserves the spatial relevance of objects along with their discriminating features at later stages of learning.…”

Section: Future Directionsmentioning

confidence: 99%

A survey of the recent architectures of deep convolutional neural networks

et al. 2020

View full text Add to dashboard Cite

Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown exemplary performance on several competitions related to Computer Vision and Image Processing. Interesting application areas of CNN include Image Classification and Segmentation, Object Detection, Video Processing, Natural Language Processing, Speech Recognition, etc. The powerful learning ability of deep CNN is largely due to the use of multiple feature extraction stages that can automatically learn representations from the data. Availability of a large amount of data and improvements in the hardware technology have accelerated the research in CNNs, and recently very interesting deep CNN architectures have been reported. In fact, several interesting ideas to bring advancements in CNNs have been explored such as the use of different activation and loss functions, parameter optimization, regularization, and architectural innovations. However, the major improvement in representational capacity of the deep CNN is achieved through architectural innovations. Especially, the idea of exploiting spatial and channel information, depth and width of architecture, and multi-path information processing has gained substantial attention. Similarly, the idea of using a block of layers as a structural unit is also gaining popularity. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature-map exploitation, channel boosting, and attention. Additionally, the elementary understanding of CNN components, current challenges and applications of CNN are also provided. CNNs are the best among learning algorithms in understanding images content, and have shown exemplary results in segmentation, classification, detection, and retrieval related tasks [8], [9]. The success of CNNs has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have developed active research groups for exploring new architectures of CNN [10]. At present, most of the frontrunners of image processing and computer vision competitions are employing deep CNN based models.The attractive feature of CNN is its ability to exploit spatial or time correlation of the data. The topology of CNN is divided into multiple learning stages composed of a combination of the convolutional layers, non-linear processing units, and subsampling layers [11]. CNNs are feedforward multilayered hierarchical networks that are similar to fully connected neural network where each layer, using a bank of convolutional kernels, performs multiple transformations [12]. Convolution operation extracts useful features from locally correlated data points. Output of the convolutional kernels is assigned to non-linear processing unit (activation function), which not only helps in learning abstractions but also emb...

show abstract

Section: Future Directionsmentioning

confidence: 99%

A survey of the recent architectures of deep convolutional neural networks

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The image in Figure 4(a) which has much more Chinese patches than kana was misclassified to Chinese by their model. Bhunia [20] coupled local and global features, but it suffers from the impairment too. What is more, the use of many cropped patches can make considerably redundant computation and memory usage which can influence the efficiency especially in its LSTM module which precludes parallelization.…”

Section: B Resultsmentioning

confidence: 99%

“…As for the problem of arbitrary aspect ratios, recent methods with good performance take densely cropped image patches with fixed size as input [12], [13], [15], [20]. They also employ data augmentation somehow, but they suffered from the following three issues.…”

Section: Introductionmentioning

confidence: 99%

Patch Aggregator for Scene Text Script Identification

Cheng

Huang

Bai

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

Script identification in the wild is of great importance in a multi-lingual robust-reading system. The scripts deriving from the same language family share a large set of characters, which makes script identification a fine-grained classification problem. Most existing methods make efforts to learn a single representation that combines the local features by making a weighted average or other clustering methods, which may reduce the discriminatory power of some important parts in each script for the interference of redundant features. In this paper, we present a novel module named Patch Aggregator (PA), which learns a more discriminative representation for script identification by taking into account the prediction scores of local patches. Specifically, we design a CNN-based method consisting of a standard CNN classifier and a PA module. Experiments demonstrate that the proposed PA module brings significant performance improvements over the baseline CNN model, achieving the state-of-the-art results on three benchmark datasets for script identification: SIW-13, CVSI 2015 and RRC-MLT 2017.

show abstract

“…They have also used Discrete Wavelet Transform (DWT) to reduce the dimension of the data. In [23], the authors have used a CNN-Long Short-Term Memory (LSTM) based framework with dynamic weighting for script recognition. From each image, patches are extracted which are fed to the CNN-LSTM combination.…”

Section: Related Studymentioning

confidence: 99%

A Hybrid Swarm and Gravitation-based feature selection algorithm for handwritten Indic script classification problem

Guha

Ghosh

Singh

et al. 2021

Complex Intell. Syst.

View full text Add to dashboard Cite

In any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.

show abstract

Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network

Cited by 118 publications

References 25 publications

A survey of the recent architectures of deep convolutional neural networks

A survey of the recent architectures of deep convolutional neural networks

Patch Aggregator for Scene Text Script Identification

A Hybrid Swarm and Gravitation-based feature selection algorithm for handwritten Indic script classification problem

Contact Info

Product

Resources

About