2013 IEEE Conference on Computer Vision and Pattern Recognition 2013
DOI: 10.1109/cvpr.2013.332
|View full text |Cite
|
Sign up to set email alerts
|

Representing Videos Using Mid-level Discriminative Patches

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
89
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 120 publications
(91 citation statements)
references
References 32 publications
1
89
0
Order By: Relevance
“…The majority of research on attributes focuses on how semantic attributes can better solve a diverse set of computer vision problems [1,4,8,11,15,30] or enable new applications [12,19]. Generally, specifying these semantic attributes and generating suitable datasets from which to learn attribute classifiers is a difficult task that requires considerable effort and domain expertise.…”
Section: Semantic Attributesmentioning
confidence: 99%
“…The majority of research on attributes focuses on how semantic attributes can better solve a diverse set of computer vision problems [1,4,8,11,15,30] or enable new applications [12,19]. Generally, specifying these semantic attributes and generating suitable datasets from which to learn attribute classifiers is a difficult task that requires considerable effort and domain expertise.…”
Section: Semantic Attributesmentioning
confidence: 99%
“…Tang et al [38] proposed a method to automatically annotate discriminative objects in weakly labeled videos. Jain et al [39] represent discriminative video objects at the patch level. Segmentation masks of the extracted objects can be tracked and refined in other frames by the method proposed in [40] and [41].…”
Section: Related Workmentioning
confidence: 99%
“…SIFT [24] and Histograms of Oriented Gradients [19]) necessitate optimal alignment between training and testing data and, although they possess strong discriminative power, they fail to take advantage of whole body actions. A recently proposed approach in the domain of computer vision has introduced the notion of mid-level descriminative patches [12] to automatically extract semantically rich spatial or spatiotemporal windows of RGB information, in order to distinguish elements that account for primitive human actions. Various feature extraction techniques have also been proposed in the area of depth maps for human action recognition; typical is the work in [6], where the authors proposed the use of Depth Motion Maps (DMMs) for capturing motion and shape cues concurrently.…”
Section: Related Workmentioning
confidence: 99%
“…For training, as before, a leave-one-subject out protocol was followed, the Mahalanobis distance was used in (12), while the maximum allowed number of sub-clusters per action was two, and highly imbalanced sub-clusters were merged into the same cluster. Table 1 shows results achieved with the proposed method and different combinations of modalities.…”
Section: Huawei/3dlife Datasetmentioning
confidence: 99%