Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) Model for Human Action Recognition

Nazir, Saima; Yousaf, Muhammad Haroon; Nebel, Jean‐Christophe; Velastín, Sergio A.

doi:10.3390/s19122790

Cited by 11 publications

(2 citation statements)

References 63 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The key step in video action recognition is extracting the effective spatiotemporal features where the spatial feature is mainly used to describe the global scene configuration and the appearance of objects in a single frame of the video, while the temporal feature is extracted to represent motion cues among multiple frames over time. In recent years, many video action recognition methods have been proposed, which can be mainly divided into two categories [7]: hand-crafted feature-based action recognition [8,9], and deep learning network-based action recognition [10,11]. Hand-crafted feature-based methods usually detect key spatiotemporal points in the video and then represent these points with local descriptors, while deep learning-based methods utilize multilayers to automatically and progressively extract high-level features from raw input.…”

Section: Introductionmentioning

confidence: 99%

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition

Chen

Kong

Sun

et al. 2020

Sensors

View full text Add to dashboard Cite

Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.

show abstract

Section: Introductionmentioning

confidence: 99%

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition

Chen

Kong

Sun

et al. 2020

Sensors

View full text Add to dashboard Cite

show abstract

“…Two different techniques have been proposed for dictionary building in [25]: modular dictionary and single dictionary. In [26,27] Nazir et al proposed the dynamic spatio-temporal bag of expressions (D-STBoE) model and the BoE framework for action recognition which improves the existing strength of bag of words. A global feature ensemble representation is discussed by Chen et al [18] who combined the HOG vehicle features extracted in a grid-based pattern.…”

Section: Introductionmentioning

confidence: 99%

Vehicle Make and Model Recognition using Bag of Expressions

Jamil

Hussain

Yousaf

et al. 2020

Sensors

Self Cite

View full text Add to dashboard Cite

Vehicle make and model recognition (VMMR) is a key task for automated vehicular surveillance (AVS) and various intelligent transport system (ITS) applications. In this paper, we propose and study the suitability of the bag of expressions (BoE) approach for VMMR-based applications. The method includes neighborhood information in addition to visual words. BoE improves the existing power of a bag of words (BOW) approach, including occlusion handling, scale invariance and view independence. The proposed approach extracts features using a combination of different keypoint detectors and a Histogram of Oriented Gradients (HOG) descriptor. An optimized dictionary of expressions is formed using visual words acquired through k-means clustering. The histogram of expressions is created by computing the occurrences of each expression in the image. For classification, multiclass linear support vector machines (SVM) are trained over the BoE-based features representation. The approach has been evaluated by applying cross-validation tests on the publicly available National Taiwan Ocean University-Make and Model Recognition (NTOU-MMR) dataset, and experimental results show that it outperforms recent approaches for VMMR. With multiclass linear SVM classification, promising average accuracy and processing speed are obtained using a combination of keypoint detectors with HOG-based BoE description, making it applicable to real-time VMMR systems.

show abstract