2012 IEEE Conference on Computer Vision and Pattern Recognition 2012
DOI: 10.1109/cvpr.2012.6247807
|View full text |Cite
|
Sign up to set email alerts
|

Discovering discriminative action parts from mid-level video representations

Abstract: We describe a mid-level approach for action recognition. From an input video, we extract salient spatio-temporal structures by forming clusters of trajectories that serve as candidates for the parts of an action. The assembly of these clusters into an action class is governed by a graphical model that incorporates appearance and motion constraints for the individual parts and pairwise constraints for the spatio-temporal dependencies among them. During training, we estimate the model parameters discriminatively… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
234
2

Year Published

2012
2012
2015
2015

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 211 publications
(239 citation statements)
references
References 29 publications
3
234
2
Order By: Relevance
“…Figure 1), in order to build a hierarchical model of the motion content of a video. This is in contrast to existing approaches [39] that view videos as a bag of clusters. We introduce a corresponding tree representation of actions, called BOF-tree.…”
Section: Introductioncontrasting
confidence: 41%
See 1 more Smart Citation
“…Figure 1), in order to build a hierarchical model of the motion content of a video. This is in contrast to existing approaches [39] that view videos as a bag of clusters. We introduce a corresponding tree representation of actions, called BOF-tree.…”
Section: Introductioncontrasting
confidence: 41%
“…Wang and Mori [50] use tracking and a Hidden Conditional Random Field (HCRF) to learn a discriminative model of latent parts for frame-by-frame recognition. Closest to our work, Raptis et al [39] extract clusters of long-term trajectories and learn a latent model over a fixed number of parts. Their approach has a cubic time complexity in the number of trajectories, relies on bounding box annotations, and uses only a fixed small subset of clusters for all videos.…”
Section: Introductionmentioning
confidence: 99%
“…These motion features are then used in an SVM for performing action recognition. Raptis et al [36] discuss another trajectory clustering approach for obtaining mid-level descriptions of the most significant action components useful for recognition. However, we compare our results with our feature detectors in a bag-of-features SVM framework.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired from the work of [22] and based on a few weak annotations on a sparse set of frames, shown in Figure 8, two types of poselet features, including the HOG descriptors and the BoW features, are used for training the poselet detector. The BoW features, quantized dense descriptors (SIFT [43], histogram of optical flow (HOF) [44], and motion boundaries (HoMB) [45]), are used to augment the HOG descriptors for capturing the motion information of poselets. In this paper, the background information is removed from the poselets by the segmentation scheme, which, in turn, improves both the quality of the poselet models in the learning phase and the recognition accuracy in the testing phase.…”
Section: The Application To Human Activity Recognitionmentioning
confidence: 99%