Representing Videos Using Mid-level Discriminative Patches

Jain, Arpit; Gupta, Abhinav; Rodríguez, Mikel; Davis, Larry S.

doi:10.1109/cvpr.2013.332

Cited by 120 publications

(91 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The majority of research on attributes focuses on how semantic attributes can better solve a diverse set of computer vision problems [1,4,8,11,15,30] or enable new applications [12,19]. Generally, specifying these semantic attributes and generating suitable datasets from which to learn attribute classifiers is a difficult task that requires considerable effort and domain expertise.…”

Section: Semantic Attributesmentioning

confidence: 99%

DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition

Hou¹,

Zamir²,

Sukthankar

et al. 2014

Computer Vision – ECCV 2014

View full text Add to dashboard Cite

Abstract. We propose a method for learning discriminative categorylevel features and demonstrate state-of-the-art results on large-scale action recognition in video. The key observation is that one-vs-rest classifiers, which are ubiquitously employed for this task, face challenges in separating very similar categories (such as running vs. jogging). Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, using a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. We then exploit the observation that while splitting such "Siamese Twin" categories may be difficult, separating them from the remaining categories in a two-vs-rest framework is not. This enables us to augment one-vs-rest classifiers with a judicious selection of "two-vs-rest" classifier outputs, formed from such discriminative and mutually nearest (DaMN) pairs. By combining onevs-rest and two-vs-rest features in a principled probabilistic manner, we achieve state-of-the-art results on the UCF101 and HMDB51 datasets. More importantly, the same DaMN features, when treated as a mid-level representation also outperform existing methods in knowledge transfer experiments, both cross-dataset from UCF101 to HMDB51 and to new categories with limited training data (one-shot and few-shot learning). Finally, we study the generality of the proposed approach by applying DaMN to other classification tasks; our experiments show that DaMN outperforms related approaches in direct comparisons, not only on video action recognition but also on their original image dataset tasks.

show abstract

Section: Semantic Attributesmentioning

confidence: 99%

DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition

Hou¹,

Zamir²,

Sukthankar

et al. 2014

Computer Vision – ECCV 2014

View full text Add to dashboard Cite

show abstract

“…Tang et al [38] proposed a method to automatically annotate discriminative objects in weakly labeled videos. Jain et al [39] represent discriminative video objects at the patch level. Segmentation masks of the extracted objects can be tracked and refined in other frames by the method proposed in [40] and [41].…”

Section: Related Workmentioning

confidence: 99%

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

Zhang

et al. 2018

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Abstract-Personal videos often contain visual distractors, which are objects that are accidentally captured that can distract viewers from focusing on the main subjects. We propose a method to automatically detect and localize these distractors through learning from a manually labeled dataset. To achieve spatially and temporally coherent detection, we propose extracting features at the Temporal-Superpixel (TSP) level using a traditional SVM-based learning framework. We also experiment with end-to-end learning using Convolutional Neural Networks (CNNs), which achieves slightly higher performance than other methods. The classification result is further refined in a post-processing step based on graph-cut optimization. Experimental results show that our method achieves an accuracy of 81% and a recall of 86%. We demonstrate several ways of removing the detected distractors to improve the video quality, including video hole filling; video frame replacement; and camera path re-planning. The user study results show that our method can significantly improve the aesthetic quality of videos.

show abstract

“…SIFT [24] and Histograms of Oriented Gradients [19]) necessitate optimal alignment between training and testing data and, although they possess strong discriminative power, they fail to take advantage of whole body actions. A recently proposed approach in the domain of computer vision has introduced the notion of mid-level descriminative patches [12] to automatically extract semantically rich spatial or spatiotemporal windows of RGB information, in order to distinguish elements that account for primitive human actions. Various feature extraction techniques have also been proposed in the area of depth maps for human action recognition; typical is the work in [6], where the authors proposed the use of Depth Motion Maps (DMMs) for capturing motion and shape cues concurrently.…”

Section: Related Workmentioning

confidence: 99%

“…For training, as before, a leave-one-subject out protocol was followed, the Mahalanobis distance was used in (12), while the maximum allowed number of sub-clusters per action was two, and highly imbalanced sub-clusters were merged into the same cluster. Table 1 shows results achieved with the proposed method and different combinations of modalities.…”

Section: Huawei/3dlife Datasetmentioning

confidence: 99%

Landmark-based multimodal human action recognition

Asteriadis

Daras

2016

Multimed Tools Appl

View full text Add to dashboard Cite

Human activity recognition has received a lot of attention recently, mainly thanks to the advancements in sensing technologies and systems' increasing computational power. However, complexity in human movements, sensing devices' noise and person-specific characteristics impose challenges that still remain to be overcome. In the proposed work, a novel, multi-modal human action recognition method is presented for handling the aforementioned issues. Each action is represented by a basis vector and spectral analysis is performed on an affinity matrix of new action feature vectors. Using modality-dependent kernel regressors for computing the affinity matrix, complexity is reduced and robust lowdimensional representations are achieved. The proposed scheme supports online adaptivity of modalities, in a dynamic fashion, according to their automatically inferred reliability. Evaluation on three publicly available datasets demonstrates the potential of the approach.

show abstract

Representing Videos Using Mid-level Discriminative Patches

Cited by 120 publications

References 32 publications

DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition

DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

Landmark-based multimodal human action recognition

Contact Info

Product

Resources

About