2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.316
|View full text |Cite
|
Sign up to set email alerts
|

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

Abstract: General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
144
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 205 publications
(146 citation statements)
references
References 48 publications
2
144
0
Order By: Relevance
“…This result suggests that high-precision pose estimation is not essential within the STAR framework for the purpose of action recognition. This is in contrast to other pose-based approaches that saw significant performance improvements using ground-truth keypoint locations [33], [45], [46]. The inference speeds of each end-to-end architecture are also reported in Table I.…”
Section: B Star-net Resultsmentioning
confidence: 96%
See 1 more Smart Citation
“…This result suggests that high-precision pose estimation is not essential within the STAR framework for the purpose of action recognition. This is in contrast to other pose-based approaches that saw significant performance improvements using ground-truth keypoint locations [33], [45], [46]. The inference speeds of each end-to-end architecture are also reported in Table I.…”
Section: B Star-net Resultsmentioning
confidence: 96%
“…On UTD-MHAD, the effect of combining the models decreased performance. The results in Table II also show that STAR-Net outperforms several methods using richer data modalities, but surrenders 2.8% to the state-of-the-art pose-based method of Liu et al [32], who used spatial rank pooling to encode the evolution of 2D [33] 57.0 Chron et al (P-CNN) [45] 61.1 Gkioxari et al (Action Tubes) [49] 62.5 Peng et al (MR TS R-CNN) [10] 71.1 Zolfaghari et al (Chained) [46] 76.1 STAR-Net 64.3 pose images and averaged pose heatmaps (i.e., as separate streams). Interestingly, the performance of each stream alone was 85.6% and 74.9%, respectively, indicating that these streams were highly complementary.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 99%
“…In addition, since DD-net employs one-dimensional CNNs to extract the feature, it is much faster than other models that use RNNs [31], [22], [32], [25] or 2D/3D CNNs [5], [39], [7], [8], [28]. During its inferences, DD-Net's speed can reach around 3,500 FPS on one GPU (i.e., GTX 1080Ti), or, 2,000 FPS on one CPU (i.e., Intel E5-2620).…”
Section: Results Analysis and Discussionmentioning
confidence: 99%
“…For skeleton-based action recognition, two types of input features are commonly used: the geometric feature [18], [22] and the Cartesian coordinate feature [31], [32], [34], [6], [7]. The Cartesian coordinate feature is variant to locations and viewpoints.…”
Section: A Modeling Location-viewpoint Invariant Feature By Joint Comentioning
confidence: 99%
“…Liu et al [44] jointly learned the regression and classification network with multi-modal data for action detection. Though the methods in [14], [15], [44], [45] illustrate that the introduction of multiple cues improve the performance for action analytics, they are limited by the strict data requirements. [15] requires aligned skeletons and depth data in training.…”
Section: B Skeleton-based Action Recognitionmentioning
confidence: 99%