2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.172
|View full text |Cite
|
Sign up to set email alerts
|

Generalized Rank Pooling for Activity Recognition

Abstract: Most popular deep models for action recognition split video sequences into short sub-sequences consisting of a few frames; frame-based features are then pooled for recognizing the activity. Usually, this pooling step discards the temporal order of the frames, which could otherwise be used for better recognition. Towards this end, we propose a novel pooling method, generalized rank pooling (GRP), that takes as input, features from the intermediate layers of a CNN that is trained on tiny sub-sequences, and produ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
90
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 80 publications
(92 citation statements)
references
References 48 publications
(106 reference statements)
2
90
0
Order By: Relevance
“…In the context of video data, the temporal structure of video data has been exploited to fine-tune networks on train-1 https://github.com/annusha/unsup_temp_embed ing data without labels [34,2]. The temporal ordering of video frames has also been used to learn feature representations for action recognition [20,23,9,4]. Lee et al [20] learn a video representation in an unsupervised manner by solving a sequence sorting problem.…”
Section: Related Workmentioning
confidence: 99%
“…In the context of video data, the temporal structure of video data has been exploited to fine-tune networks on train-1 https://github.com/annusha/unsup_temp_embed ing data without labels [34,2]. The temporal ordering of video frames has also been used to learn feature representations for action recognition [20,23,9,4]. Lee et al [20] learn a video representation in an unsupervised manner by solving a sequence sorting problem.…”
Section: Related Workmentioning
confidence: 99%
“…Enforcing such subspace constraints (orthonormality) on these hyperplanes are often empirically seen to demonstrate better performance as is also observed in [10]. The operator ⊙ is the element-wise multiplication and the quantity max(y(θ) ⊙ W ⊤ θ) captures the maximum value of the element-wise multiplication, signifying that if at least one hyperplane classifies θ correctly, then the hinge-loss will be zero.…”
Section: Discriminative Subspace Poolingmentioning
confidence: 88%
“…While, better CNN architectures, such as the recent I3D framework [8], is essential for pushing the state-of-the-art on video tasks, it is also important to have efficient representation learning schemes that can capture the long-term temporal video dynamics from predictions generated by a temporally local model. Recent efforts in this direction, such as rank pooling, temporal segment networks and temporal relation networks [55, 10,22,18,21,5,45], aim to incorporate temporal dynamics over clip-level features. However, such models often ignore the noise in the videos, and use representations that adhere to a plausible criteria.…”
Section: Introductionmentioning
confidence: 99%
“…In Bilen et al, [14], rank pooling is extended towards an early frame-level fusion, dubbed dynamic images; Wang et al [56], extends this idea to use optical flow, which they call dynamic flow representation. Cherian et al [17] generalized rank pooling to include multiple hyperplanes as a subspace, enabling a richer characterization of the spatio-temporal details of the video. This idea was further extended to non-linear feature representations via kernelized rank pooling in [16].…”
Section: Video Representation Using Pooling Schemesmentioning
confidence: 99%