A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition

Chen, Enqing; Bai, Xue; Gao, Lei; Tinega, Haron Chweya; Ding, Yingqiang

doi:10.1109/access.2019.2910604

Cited by 29 publications

(27 citation statements)

References 34 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Huang et al [19] introduced an Optical Flow guided Feature (OFF), which can replace optical flow to quickly extract robust temporal information by convolutional neural network. Chen et al [20] proposed a spatiotemporal heterogeneous two-stream network, which employs two different network structures for spatial and temporal information.…”

Section: A Video Action Recognitionmentioning

confidence: 99%

“…Comparison with current state-of-the-art methods on UCF101 and HMDB51. Model UCF101 HMDB51 IDT [2] 85.9% 57.2% C3D + IDT [8] 90.4% -P3D ResNet [22] 93.7% -Two-stream [7] 88.0% 59.4% Two-Stream TSN (BN-Inception) [15] 94.0% 68.5% Two-Stream TSN (DenseNet169) [15] 93.3% 68.3% C²LSTM [24] 92.8% 61.3% L²LSTM [25] 93.6% 66.2% HR-MSCNN + IDT [13] 94.5% 69.8% STDDCN [16] 93.8% 66.9% MLDF-3D [18] 93.5% 68.6% Two-Stream Heterogeneity [20] 94.4% 67.2% STRN [26] 93.2% 64.9% Ours (DenseNet169) 94.2% 70.3% Ours (DenseNet201) 94.6% 70.9%…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

See 1 more Smart Citation

Temporal Segment Connection Network for Action Recognition

Yang

Chen

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Two-stream Convolutional Neural Networks have shown excellent performance in video action recognition. Most existing works train each sampling group independently, or just fuse at the last level, which obviously ignore the continuity of action in temporal and the complementary information between action fragments. In this paper, a temporal segment connection network is proposed to overcome these limitations. On the one hand, the forget gate module of the long short-term memory (LSTM) network is used to establish feature-level connections between each sampling group. This not only strengthens the information transmission between the sampling groups to enhance the temporal connectivity, but also extracts the complementary information between the sampling groups to enhance the overall representation of the action. On the other hand, a bi-directional long short-term memory (Bi-LSTM) network is used to automatically evaluate the importance weights of each sampling group based on the deep feature sequence. The experimental results on UCF101 and HMDB51 datasets show that the proposed model can effectively improve the utilization rate of temporal information and the ability of overall action representation, thus significantly improves the accuracy of human action recognition. INDEX TERMS Action recognition, convolutional neural network, two-stream, forget-gate connection module, adaptive weighting module.

show abstract

Section: A Video Action Recognitionmentioning

confidence: 99%

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

Temporal Segment Connection Network for Action Recognition

Yang

Chen

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…IDT [53] 86.40% 61.70% Spatiotemporal ConvNet [8] 65.40% -Long-term recurrent ConvNet [54] 82.90% -Composite LSTM Model [55] 84.30% 44.00% Two-Stream ConvNet [17] 88.00% 59.40% P3D ResNets (Without IDT) [7] 88.60% -Two-Stream+LSTM [56] 88.60% -C3D [42] 85.20% -Res3D [57] 85.80% 54.90% Dynamic Image Networks [58] 76.90% 42.80% Dynamic Image Networks + IDT [58] 89.10% 65.20% Asymmetric 3D-CNN (RGB+RGBF+IDT) [59] 92.60% 65.40% T3D [60] 93.20% 63.50% TDD+IDT [61] 91.50% 65.90% Conv Fusion (Without IDT) [47] 92.50% 65.40% Transformations [51] 92.40% 62.00% VideoLSTM + IDT [62] 92.20% 64.90% Hierarchical Attention Networks [63] 92.70% 64.30% Spatiotemporal Multiplier ConvNet [19] 94.20% 68.90% Sequential Learning Framework [64] 90.90% 65.70% T-ResNets (Without IDT) [16] 93.90% 67.20% TSN (2 modalities) [65] 94.00% 68.50% Spatiotemporal Heterogeneous Two-stream Network [66] 94.40% 67.20%…”

Section: Ucf101 Hmdb51mentioning

confidence: 99%

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition

Chen

Kong

Sun

et al. 2020

Sensors

View full text Add to dashboard Cite

Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.

show abstract

“…Currently, several methods [9], [10], [28], [30] rely on deep learning to perform video-based action recognition. For example, Gkioxari et al [9] proposed to use R*NN for context-based action recognition.…”

Section: Introductionmentioning

confidence: 99%

“…Peng and Schmid [10] developed a multi-region two-stream R-CNN method for action detection. Chen et al [28] proposed a spatiotemporal heterogeneous two-stream network for video action recognition, which employs two different network structures for spatial and temporal information, respectively. Tang et al [30] proposed a Semantics Preserving Teacher-Student (SPTS) networks architecture, which is applied on action segmentation task.…”

Section: Introductionmentioning

confidence: 99%

Cholesky Decomposition-Based Metric Learning for Video-Based Human Action Recognition

et al. 2020

View full text Add to dashboard Cite

Video-based human action recognition can understand human actions and behaviours in the video sequences, and has wide applications for health care, human-machine interaction and so on. Metric learning, which learns a similarity metric, plays an important role in human action recognition. However, learning a full-rank matrix is usually inefficient and easily leads to overfitting. In order to overcome the above issues, a common way is to impose the low-rank constraint on the learned matrix. This paper proposes a novel Cholesky decomposition based metric learning (CDML) method for effective video-based human action recognition. Firstly, the improved dense trajectories technique and the vector of locally aggregated descriptor (VLAD) are respectively used for feature detection and feature encoding. Then, considering the high dimensionality of VLAD features, we propose to learn a similarity matrix by taking advantage of Cholesky decomposition, which decomposes the matrix into the product between a lower triangular matrix and its symmetric matrix. Different from the traditional low-rank metric learning methods that explicitly adopt the low-rank constraint to learn the matrix, the proposed algorithm achieves such a constraint by controlling the rank of the lower triangular matrix, thus leading to high computational efficiency. Experimental results on the public video dataset show that the proposed method achieves the superior performance compared with several state-of-the-art methods.

show abstract

A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition

Cited by 29 publications

References 34 publications

Temporal Segment Connection Network for Action Recognition

Temporal Segment Connection Network for Action Recognition

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition

Cholesky Decomposition-Based Metric Learning for Video-Based Human Action Recognition

Contact Info

Product

Resources

About