Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Duta, Ionut Cosmin; Ionescu, Bogdan; Aizawa, Kiyoharu; Sebe, Nicu

doi:10.1109/cvpr.2017.341

Cited by 51 publications

(34 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some studies suggested various methods, such as support vector machine (SVM) [19,20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance. In more recent research, a significant performance increment was achieved using deep ConvNet [1,2,3,23,24]. Simonyan et al [4] proposed a two-stream architecture composed of a spatial and a temporal stream to capture appearance and motion features separately.…”

Section: Related Workmentioning

confidence: 99%

Three-stream fusion network for first-person interaction recognition

Kim

Lee

2020

Pattern Recognition

View full text Add to dashboard Cite

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearers movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. The three-stream architecture captures the characteristics of the target appearance, target motion, and camera egomotion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion, and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory(LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

show abstract

Section: Related Workmentioning

confidence: 99%

Three-stream fusion network for first-person interaction recognition

Kim

Lee

2020

Pattern Recognition

View full text Add to dashboard Cite

show abstract

“…Method UCF-101 HMDB-51 iDT [24] 86.4 61.7 Two stream CNN [15] 88.0 59.4 TDD [26] 91.5 65.9 Long Term Convolution [22] 91.7 64.8 Spatiotemporal Pyramid Network [28] 94.6 68.9 Spatiotemporal Multiplier Network [6] 94.2 68.9 Two stream TSN [27] 94.0 68.5 ST-VLMPF [4] 93.6 69.5 Two-Stream I3D [2] 93.4 66.4 Lattice LSTM [18] 93.6 66.2 Full OFF [19] 96.0 74.2 Full IF-TTN 96.2 74.8 C3D [20] 82.3 -TSN(RGB) [27] 85.7 51.0 TSN(RGB+RGB Difference) [27] 91.0 -RGB+EMV-CNN 86.4 53.0 CoViAR [29] 90.4 59.1 real-time OFF [19] 93.3 -MV-IF-TTN 94.5 70.0 while the lower part presents real-time methods. Notice that for non-real-time methods we assemble the optical flow and motion vectors based IF-TTN scores to make final predictions (denoted as Full IF-TTN).…”

Section: Comparison With the State Of The Artmentioning

confidence: 99%

“…We compare our method with both traditional approaches, like iDT [24], and deep learning based methods, such as Two-Stream CNN [15], C3D [20], TSN [27], Temporal Deep convolutional Descriptors (TDD) [26], Longterm Temporal CNN [22], Spatiotemporal Pyramid Network [28], SaptioTemporal Multiplier Network [6], Spatiotemporal Vector of Locally Max Pooled Features (ST-VLMPF) [4], Lattice LSTM [18], and Inflated 3D CNN (I3D) [2] and Optical Flow guided Features (OFF) [19]. Our full IF-TTN achieves state-of-the-art results on both datasets.…”

Section: Comparison With the State Of The Artmentioning

confidence: 99%

Attentional Fused Temporal Transformation Network for Video Action Recognition

Yang¹,

Wang²,

Dai³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Effective spatiotemporal feature representation is crucial to the video-based action recognition task. Focusing on discriminate spatiotemporal feature learning, we propose Information Fused Temporal Transformation Network (IF-TTN) for action recognition on top of popular Temporal Segment Network (TSN) framework. In the network, Information Fusion Module (IFM) is designed to fuse the appearance and motion features at multiple ConvNet levels for each video snippet, forming a short-term video descriptor. With fused features as inputs, Temporal Transformation Networks (TTN) are employed to model middle-term temporal transformation between the neighboring snippets following a sequential order. As TSN itself depicts longterm temporal structure by segmental consensus, the proposed network comprehensively considers multiple granularity temporal features. Our IF-TTN achieves the stateof-the-art results on two most popular action recognition datasets: UCF101 and HMDB51. Empirical investigation reveals that our architecture is robust to the input motion map quality. Replacing optical flow with the motion vectors from compressed video stream, the performance is still comparable to the flow-based methods while the testing speed is 10x faster.

show abstract

“…Similar to [2], Feichtenhofer et al [15] showed that a twostream fusion at an intermediate layer using RGB images and a stack of ten optical flow frames can improve the performance with less parameters. Extensions of two stream networks include Two-stream ConvNet(original) [2], Two-stream Con-vPooling [44], TDD+FV [13], Two-stream Transformations [51], Two-stream ResNet [15], TSN (3 modalities) [14], KVMF [52], ST-ResNet [15], AdaScan [53], Three-stream sDTD [54], ST-VLMPF [17], SPN (BN-Inception) [55], and ActionVLAD [56]. Despite the good performance of multistream framework, it still remains unclear whether the deep learning based model can capture the subtle motion model and long-term motion dynamics for good performance without multi-stream fusion.…”

Section: Related Workmentioning

confidence: 99%

“…Action classification in video had been one of the most challenging problems next to the image classification [10]. Recent deep learning approaches including 3D CNN [11], two-stream CNNs [2], C3D [12], TDD [13], TSN [14], ST-ResNet+iDT [15], L 2 STM [16], ST-VLMPF [17], P3D ResNet [18], I3D [19], 3D ResNeXt [20], R(2+1)D-TwoStream [7], CO2FI+ASYN [21], and DML [22] have shown state-ofthe-art performances in action recognition. The recent development of CNNs with spatio-temporal 3D convolutional kernels (3D CNNs) rapidly grows and contributes to significant advances in video recognition [7], [18]- [20] because 3D CNNs can be used to directly extract spatio-temporal features from raw videos.…”

Section: Introductionmentioning

confidence: 99%

Deep Manifold Structure Transfer for Action Recognition

Zhang

Chen

et al. 2019

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

While intrinsic data structure in subspace provides useful information for visual recognition, it has not yet been well studied in deep feature learning for action recognition. In this paper, we introduce a new spatio-temporal manifold network (STMN) that leverages data manifold structures to regularize deep action feature learning, aiming at simultaneously minimizing the intra-class variations of learned deep features and alleviating the over-fitting problem. To this end, the manifold prior is imposed from the top layer of a convolutional neural network (CNN), and is propagated across convolutional layers during forward-backward propagation. The observed correspondence of manifold structures in the data space and feature space validates that the manifold priori can be transferred across CNN layers. STMN theoretically recasts the problem of transferring the data structure prior into the deep learning architectures as a projection over the manifold via an embedding method, which can be easily solved by an Alternating Direction Method of Multipliers and Backward Propagation (ADMM-BP) algorithm. STMN is generic in the sense that it can be plugged into various backbone architectures to learn more discriminative representation for action recognition. Extensive experimental results show that our method achieves comparable or even better performance as compared with the state-of-the-art approaches on four benchmark datasets.

show abstract

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Cited by 51 publications

References 43 publications

Three-stream fusion network for first-person interaction recognition

Three-stream fusion network for first-person interaction recognition

Attentional Fused Temporal Transformation Network for Video Action Recognition

Deep Manifold Structure Transfer for Action Recognition

Contact Info

Product

Resources

About