A Bag of Expression framework for improved human action recognition

Mathematical Problems in Engineering

Irtaza

et al. 2019

Self Cite

Human action recognition has the potential to predict the activities of an instructor within the lecture room. Evaluation of lecture delivery can help teachers analyze shortcomings and plan lectures more effectively. However, manual or peer evaluation is time-consuming, tedious and sometimes it is difficult to remember all the details of the lecture. Therefore, automation of lecture delivery evaluation significantly improves teaching style. In this paper, we propose a feedforward learning model for instructor’s activity recognition in the lecture room. The proposed scheme represents a video sequence in the form of a single frame to capture the motion profile of the instructor by observing the spatiotemporal relation within the video frames. First, we segment the instructor silhouettes from input videos using graph-cut segmentation and generate a motion profile. These motion profiles are centered by obtaining the largest connected components and normalized. Then, these motion profiles are represented in the form of feature maps by a deep convolutional neural network. Then, an extreme learning machine (ELM) classifier is trained over the obtained feature representations to recognize eight different activities of the instructor within the classroom. For the evaluation of the proposed method, we created an instructor activity video (IAVID-1) dataset and compared our method against different state-of-the-art activity recognition methods. Furthermore, two standard datasets, MuHAVI and IXMAS, were also considered for the evaluation of the proposed scheme.

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 92%

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Instructor Activity Recognition through Deep Spatiotemporal Features and Feedforward Extreme Learning Machines

Nida

Mathematical Problems in Engineering

Irtaza

et al. 2019

Self Cite

“…Two different techniques have been proposed for dictionary building in [25]: modular dictionary and single dictionary. In [26,27] Nazir et al proposed the dynamic spatio-temporal bag of expressions (D-STBoE) model and the BoE framework for action recognition which improves the existing strength of bag of words. A global feature ensemble representation is discussed by Chen et al [18] who combined the HOG vehicle features extracted in a grid-based pattern.…”

Section: Introductionmentioning

confidence: 99%

Vehicle Make and Model Recognition using Bag of Expressions

Jamil

Hussain

et al. 2020

Sensors

Self Cite

Vehicle make and model recognition (VMMR) is a key task for automated vehicular surveillance (AVS) and various intelligent transport system (ITS) applications. In this paper, we propose and study the suitability of the bag of expressions (BoE) approach for VMMR-based applications. The method includes neighborhood information in addition to visual words. BoE improves the existing power of a bag of words (BOW) approach, including occlusion handling, scale invariance and view independence. The proposed approach extracts features using a combination of different keypoint detectors and a Histogram of Oriented Gradients (HOG) descriptor. An optimized dictionary of expressions is formed using visual words acquired through k-means clustering. The histogram of expressions is created by computing the occurrences of each expression in the image. For classification, multiclass linear support vector machines (SVM) are trained over the BoE-based features representation. The approach has been evaluated by applying cross-validation tests on the publicly available National Taiwan Ocean University-Make and Model Recognition (NTOU-MMR) dataset, and experimental results show that it outperforms recent approaches for VMMR. With multiclass linear SVM classification, promising average accuracy and processing speed are obtained using a combination of keypoint detectors with HOG-based BoE description, making it applicable to real-time VMMR systems.

“…However, since BoVW approaches only consider the frequency of each visual word and ignore the spatio-temporal relation among visual words, such representation is not able to exploit the contextual relationship between visual words. To overcome this drawback, we propose an extension of our previously proposed Bag of Expressions (BoE) model [7], which incorporates contextual relationships of visual words while preserving inherent qualities of the classical BoVW approach. The main idea of visual expression formation is to store the spatio-temporal contextual information that is usually lost during formation of visual words by encoding neighboring spatio-temporal interest points’ information.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) Model for Human Action Recognition

Nazir

Nebel

et al. 2019

Sensors

Self Cite

Human action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.