2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.341
|View full text |Cite
|
Sign up to set email alerts
|

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
34
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 51 publications
(34 citation statements)
references
References 43 publications
0
34
0
Order By: Relevance
“…Some studies suggested various methods, such as support vector machine (SVM) [19,20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance. In more recent research, a significant performance increment was achieved using deep ConvNet [1,2,3,23,24]. Simonyan et al [4] proposed a two-stream architecture composed of a spatial and a temporal stream to capture appearance and motion features separately.…”
Section: Related Workmentioning
confidence: 99%
“…Some studies suggested various methods, such as support vector machine (SVM) [19,20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance. In more recent research, a significant performance increment was achieved using deep ConvNet [1,2,3,23,24]. Simonyan et al [4] proposed a two-stream architecture composed of a spatial and a temporal stream to capture appearance and motion features separately.…”
Section: Related Workmentioning
confidence: 99%
“…Method UCF-101 HMDB-51 iDT [24] 86.4 61.7 Two stream CNN [15] 88.0 59.4 TDD [26] 91.5 65.9 Long Term Convolution [22] 91.7 64.8 Spatiotemporal Pyramid Network [28] 94.6 68.9 Spatiotemporal Multiplier Network [6] 94.2 68.9 Two stream TSN [27] 94.0 68.5 ST-VLMPF [4] 93.6 69.5 Two-Stream I3D [2] 93.4 66.4 Lattice LSTM [18] 93.6 66.2 Full OFF [19] 96.0 74.2 Full IF-TTN 96.2 74.8 C3D [20] 82.3 -TSN(RGB) [27] 85.7 51.0 TSN(RGB+RGB Difference) [27] 91.0 -RGB+EMV-CNN 86.4 53.0 CoViAR [29] 90.4 59.1 real-time OFF [19] 93.3 -MV-IF-TTN 94.5 70.0 while the lower part presents real-time methods. Notice that for non-real-time methods we assemble the optical flow and motion vectors based IF-TTN scores to make final predictions (denoted as Full IF-TTN).…”
Section: Comparison With the State Of The Artmentioning
confidence: 99%
“…We compare our method with both traditional approaches, like iDT [24], and deep learning based methods, such as Two-Stream CNN [15], C3D [20], TSN [27], Temporal Deep convolutional Descriptors (TDD) [26], Longterm Temporal CNN [22], Spatiotemporal Pyramid Network [28], SaptioTemporal Multiplier Network [6], Spatiotemporal Vector of Locally Max Pooled Features (ST-VLMPF) [4], Lattice LSTM [18], and Inflated 3D CNN (I3D) [2] and Optical Flow guided Features (OFF) [19]. Our full IF-TTN achieves state-of-the-art results on both datasets.…”
Section: Comparison With the State Of The Artmentioning
confidence: 99%
“…Similar to [2], Feichtenhofer et al [15] showed that a twostream fusion at an intermediate layer using RGB images and a stack of ten optical flow frames can improve the performance with less parameters. Extensions of two stream networks include Two-stream ConvNet(original) [2], Two-stream Con-vPooling [44], TDD+FV [13], Two-stream Transformations [51], Two-stream ResNet [15], TSN (3 modalities) [14], KVMF [52], ST-ResNet [15], AdaScan [53], Three-stream sDTD [54], ST-VLMPF [17], SPN (BN-Inception) [55], and ActionVLAD [56]. Despite the good performance of multistream framework, it still remains unclear whether the deep learning based model can capture the subtle motion model and long-term motion dynamics for good performance without multi-stream fusion.…”
Section: Related Workmentioning
confidence: 99%
“…Action classification in video had been one of the most challenging problems next to the image classification [10]. Recent deep learning approaches including 3D CNN [11], two-stream CNNs [2], C3D [12], TDD [13], TSN [14], ST-ResNet+iDT [15], L 2 STM [16], ST-VLMPF [17], P3D ResNet [18], I3D [19], 3D ResNeXt [20], R(2+1)D-TwoStream [7], CO2FI+ASYN [21], and DML [22] have shown state-ofthe-art performances in action recognition. The recent development of CNNs with spatio-temporal 3D convolutional kernels (3D CNNs) rapidly grows and contributes to significant advances in video recognition [7], [18]- [20] because 3D CNNs can be used to directly extract spatio-temporal features from raw videos.…”
Section: Introductionmentioning
confidence: 99%