Hierarchical Dynamic Parsing and Encoding for Action Recognition

Su, Bing; Zhou, Jiahuan; Ding, Xiaoqing; Wang, Hao; Wu, Ying

doi:10.1007/978-3-319-46493-0_13

Cited by 21 publications

(13 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rank Pooling + IDT-FV [15] 66 Algorithm mAP(%) Interaction Part Mining [60] 72.4 Video Darwin [17] 72.0 Hier. Mid-Level Actions [45] 66.8 PCNN + IDT-FV [8] 71.4 GRP [6] 68.4 GRP + IDT-FV [6] 75.5 BRKP 66.3 IBKRP 68.7 IBKRP + IDT-FV 71.8 KRP-FS 70.0 KRP-FS + IDT-FV 76.1 Table 10. MPII Cooking Activities (7 splits) Algorithm Avg.…”

Section: Algorithmmentioning

confidence: 99%

Non-linear Temporal Subspace Representations for Activity Recognition

Cherian¹,

Sra

Gould³

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characterized by their variations in time. As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order. We develop this idea further and show that such a pooling scheme can be cast as an order-constrained kernelized PCA objective. We then propose to use the parameters of a kernelized low-rank feature subspace as the representation of the sequences. We cast our formulation as an optimization problem on generalized Grassmann manifolds and then solve it efficiently using Riemannian optimization techniques. We present experiments on several action recognition datasets using diverse feature modalities and demonstrate state-of-the-art results.

show abstract

Section: Algorithmmentioning

confidence: 99%

Non-linear Temporal Subspace Representations for Activity Recognition

Cherian¹,

Sra

Gould³

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

show abstract

“…Tran et al (2015) treated videos as cubes and performed convolutions and pooling with 3D kernels. Recent methods (Li et al, 2016;Zhu et al, 2016;Wang and Hoai, 2016;Zhang et al, 2016;Su et al, 2016;Wang et al, 2016a) emphasize on action recognition in large scale videos where the background context is also taken into account. Shahroudy et al (2016b) divided the actions into body parts and proposed a multimodal-multipart learning method to represent their dynamics and appearances.…”

Section: Ralated Workmentioning

confidence: 99%

Learning Human Pose Models from Synthesized Data for Robust RGB-D Action Recognition

et al. 2019

View full text Add to dashboard Cite

We propose Human Pose Models that represent RGB and depth images of human poses independent of clothing textures, backgrounds, lighting conditions, body shapes and camera viewpoints. Learning such universal models requires training images where all factors are varied for every human pose. Capturing such data is prohibitively expensive. Therefore, we develop a framework for synthesizing the training data. First, we learn representative human poses from a large corpus of real motion captured human skeleton data. Next, we fit synthetic 3D humans with different body shapes to each pose and render each from 180 camera viewpoints while randomly varying the clothing textures, background and lighting. Generative Adversarial Networks are employed to minimize the gap between synthetic and real image distributions. CNN models are then learned that transfer human poses to a shared high-level invariant space. The learned CNN models are then used as invariant feature extractors from real RGB and depth frames of human action videos and the temporal variations are modelled by Fourier Temporal Pyramid. Finally, linear SVM is used for classification. Experiments on three benchmark cross-view human ac-tion datasets show that our algorithm outperforms existing methods by significant margins for RGB only and RGB-D action recognition.

show abstract

“…On the representation learning front of our contribution, there are a few prior pooling schemes that are similar in the sense that they also use the parameters of an optimization functional as a representation. The most related work is rankpooling and its variants [22,21,20,47,4,11,53] that use a rank-SVM for capturing the video temporal evolution. Similar to ours, Cherian et al [10] propose to use a subspace to represent video sequences.…”

Section: Related Workmentioning

confidence: 99%

Learning Discriminative Video Representations Using Adversarial Perturbations

Wang

Cherian²

2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations for improving the robustness of video representations. To this end, given a well-trained deep model for per-frame video recognition, we first generate adversarial noise adapted to this model. Using the original data features from the full video sequence and their perturbed counterparts, as two separate bags, we develop a binary classification problem that learns a set of discriminative hyperplanes -as a subspace -that will separate the two bags from each other. This subspace is then used as a descriptor for the video, dubbed discriminative subspace pooling. As the perturbed features belong to data classes that are likely to be confused with the original features, the discriminative subspace will characterize parts of the feature space that are more representative of the original data, and thus may provide robust video representations. To learn such descriptors, we formulate a subspace learning objective on the Stiefel manifold and resort to Riemannian optimization methods for solving it efficiently. We provide experiments on several video datasets and demonstrate state-of-the-art results.

show abstract

Hierarchical Dynamic Parsing and Encoding for Action Recognition

Cited by 21 publications

References 32 publications

Non-linear Temporal Subspace Representations for Activity Recognition

Non-linear Temporal Subspace Representations for Activity Recognition

Learning Human Pose Models from Synthesized Data for Robust RGB-D Action Recognition

Learning Discriminative Video Representations Using Adversarial Perturbations

Contact Info

Product

Resources

About