2019
DOI: 10.1109/access.2019.2910604
|View full text |Cite
|
Sign up to set email alerts
|

A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition

Abstract: The method based on the two-stream networks has achieved great success in video action recognition. However, most existing methods employ the same structure for both spatial and temporal networks, leading to unsatisfied performance. In this paper, we propose a spatiotemporal heterogeneous two-stream network, which employs two different network structures for spatial and temporal information, respectively. Specifically, the Residual network (ResNet) and BN-Inception are utilized as the base networks to present … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(27 citation statements)
references
References 34 publications
(55 reference statements)
0
16
0
Order By: Relevance
“…Huang et al [19] introduced an Optical Flow guided Feature (OFF), which can replace optical flow to quickly extract robust temporal information by convolutional neural network. Chen et al [20] proposed a spatiotemporal heterogeneous two-stream network, which employs two different network structures for spatial and temporal information.…”
Section: A Video Action Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Huang et al [19] introduced an Optical Flow guided Feature (OFF), which can replace optical flow to quickly extract robust temporal information by convolutional neural network. Chen et al [20] proposed a spatiotemporal heterogeneous two-stream network, which employs two different network structures for spatial and temporal information.…”
Section: A Video Action Recognitionmentioning
confidence: 99%
“…Comparison with current state-of-the-art methods on UCF101 and HMDB51. Model UCF101 HMDB51 IDT [2] 85.9% 57.2% C3D + IDT [8] 90.4% -P3D ResNet [22] 93.7% -Two-stream [7] 88.0% 59.4% Two-Stream TSN (BN-Inception) [15] 94.0% 68.5% Two-Stream TSN (DenseNet169) [15] 93.3% 68.3% C²LSTM [24] 92.8% 61.3% L²LSTM [25] 93.6% 66.2% HR-MSCNN + IDT [13] 94.5% 69.8% STDDCN [16] 93.8% 66.9% MLDF-3D [18] 93.5% 68.6% Two-Stream Heterogeneity [20] 94.4% 67.2% STRN [26] 93.2% 64.9% Ours (DenseNet169) 94.2% 70.3% Ours (DenseNet201) 94.6% 70.9%…”
Section: Comparison With the State-of-the-artmentioning
confidence: 99%
“…IDT [53] 86.40% 61.70% Spatiotemporal ConvNet [8] 65.40% -Long-term recurrent ConvNet [54] 82.90% -Composite LSTM Model [55] 84.30% 44.00% Two-Stream ConvNet [17] 88.00% 59.40% P3D ResNets (Without IDT) [7] 88.60% -Two-Stream+LSTM [56] 88.60% -C3D [42] 85.20% -Res3D [57] 85.80% 54.90% Dynamic Image Networks [58] 76.90% 42.80% Dynamic Image Networks + IDT [58] 89.10% 65.20% Asymmetric 3D-CNN (RGB+RGBF+IDT) [59] 92.60% 65.40% T3D [60] 93.20% 63.50% TDD+IDT [61] 91.50% 65.90% Conv Fusion (Without IDT) [47] 92.50% 65.40% Transformations [51] 92.40% 62.00% VideoLSTM + IDT [62] 92.20% 64.90% Hierarchical Attention Networks [63] 92.70% 64.30% Spatiotemporal Multiplier ConvNet [19] 94.20% 68.90% Sequential Learning Framework [64] 90.90% 65.70% T-ResNets (Without IDT) [16] 93.90% 67.20% TSN (2 modalities) [65] 94.00% 68.50% Spatiotemporal Heterogeneous Two-stream Network [66] 94.40% 67.20%…”
Section: Ucf101 Hmdb51mentioning
confidence: 99%
“…Currently, several methods [9], [10], [28], [30] rely on deep learning to perform video-based action recognition. For example, Gkioxari et al [9] proposed to use R*NN for context-based action recognition.…”
Section: Introductionmentioning
confidence: 99%
“…Peng and Schmid [10] developed a multi-region two-stream R-CNN method for action detection. Chen et al [28] proposed a spatiotemporal heterogeneous two-stream network for video action recognition, which employs two different network structures for spatial and temporal information, respectively. Tang et al [30] proposed a Semantics Preserving Teacher-Student (SPTS) networks architecture, which is applied on action segmentation task.…”
Section: Introductionmentioning
confidence: 99%