Efficient Action Localization with Approximately Normalized Fisher Vectors

Oneață, Dan; Verbeek, Jakob; Schmid, Cordelia

doi:10.1109/cvpr.2014.326

Cited by 66 publications

(49 citation statements)

References 33 publications

(66 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We attribute this improvement to the fact that our approach scans the video in a much more efficient way. We obtain a similar performance to Caba Heilbron et al [4] and Oneata et al [25]. This result is encouraging given that our detection pipeline operates at a much faster rate of 134 FPS.…”

Section: Daps For Action Detectionsupporting

confidence: 76%

“…Action Detection: In contrast to object detection methods, the dominant approach for action detection is still to use a sliding window approach [26,18,12] combined with action classifiers trained on multiple features [2,9,33]. Previous approaches have reduced the computational overhead of sliding window search by using branch-and-bound techniques [5,27] and exploiting some characteristics of the visual descriptors. In contrast, our model efficiently reduces the number of evaluated windows by encoding a sequence of visual descriptors.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

DAPs: Deep Action Proposals for Action Understanding

Escorcia

Heilbron

Niebles

et al. 2016

Lecture Notes in Computer Science

357

316

View full text Add to dashboard Cite

Abstract. Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.

show abstract

Section: Daps For Action Detectionsupporting

confidence: 76%

Section: Related Workmentioning

confidence: 99%

DAPs: Deep Action Proposals for Action Understanding

Escorcia

Heilbron

Niebles

et al. 2016

Lecture Notes in Computer Science

357

316

View full text Add to dashboard Cite

show abstract

“…Furthermore, only linear classifier is required by FV, which is a huge advantage in large-scale problems. Therefore, FV and its simplified non-probabilistic version VLAD become more popular in action recognition [3,10,12,21,22,24,34,36].…”

Section: Introductionmentioning

confidence: 99%

Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition

Chen

Gong²,

Wang

et al. 2015

Multimed Tools Appl

View full text Add to dashboard Cite

We propose a novel approach to model spatio-temporal distribution of local features for action recognition in videos. The proposed approach is based on the Lie Algebrized Gaussians (LAG) which is a feature aggregation approach and yields high-dimensional video signature. In the framework of LAG, local features extracted from a video are aggregated to train a video-specific Gaussian Mixture Model (GMM). Then the video-specific GMM is encoded as a vector based on Lie group theory and this step is also referred to as GMM vectorization. As the video-specific GMM gives a soft partition of the feature space, for each cell of the feature space (i.e. each Gaussian component), we use a GMM to model the spatio-temporal locations of the local features assigned to the Gaussian component. The location GMMs are encoded as vectors just like the local feature GMM. We term those vectors of location GMMs spatio-temporal LAG (STLAG). In addition, although the LAG and the popular Fisher Vector (FV) are derived from distinct theory perspectives, we find that they are closely related. Hence the power and 2 normalization proposed for the FV are also beneficial to the LAG. Experimental results show that STLAG is very effective to model spatio-temporal layout compared with other techniques such as spatio-temporal pyramid and feature augmentation. Using the state-of-the-art dense trajectory features, our approach achieves state-of-the-art performance on two challenging datasets: Hollywood2 and HMDB51.

show abstract

“…Among them, VLAD and FV show outstanding performances for human action recognition [5][24] [25] [27][28] [29]. Compared with BoW in Figure 1, VLAD records the 1st-order difference between local features and codewords, i.e., the residual vectors generated by hard assignment.…”

Section: Introductionmentioning

confidence: 99%

A novel hierarchical Bag-of-Words model for compact action representation

Sun

Liu

et al. 2016

Neurocomputing

View full text Add to dashboard Cite

Bag-of-Words (BoW) histogram of local space-time features is very popular for action representation due to its high compactness and robustness. However, its discriminant ability is limited since it only depends on the occurrence statistics of local features. Alternative models such as Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV) include more information by aggregating high-dimensional residual vectors, but they suffer from the problem of high dimensionality for final representation. To solve this problem, we novelly propose to compress residual vectors into low-dimensional residual histograms by the simple but efficient BoW quantization. To compensate the information loss of this quantization, we iteratively collect higher-order residual vectors to produce high-order residual histograms. Concatenating these histograms yields a hierarchical BoW (HBoW) model which is not only compact but also informative. In experiments, the performances of HBoW are evaluated on four benchmark datasets: HMDB51, Olympic Sports, UCF Youtube and Hollywood2. Experiment results show that HBoW yields much more compact action representation than VLAD and FV, without sacrificing recognition accuracy. Comparisons with state-of-the-art works confirm its superiority further.

show abstract

Efficient Action Localization with Approximately Normalized Fisher Vectors

Cited by 66 publications

References 33 publications

DAPs: Deep Action Proposals for Action Understanding

DAPs: Deep Action Proposals for Action Understanding

Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition

A novel hierarchical Bag-of-Words model for compact action representation

Contact Info

Product

Resources

About