Evaluating multimedia features and fusion for example-based event detection

Myers, Gregory; Nallapati, Ramesh; Hout, Julien van; Pancoast, Stephanie; Nevatia, Ramakant; Sun, Chen; Habibian, Amirhossein; Koelma, D.C.; Sande, Koen E. A. van de; Smeulders, A.W.M.; Snoek, Cees G. M.

doi:10.1007/s00138-013-0527-8

Cited by 29 publications

(14 citation statements)

References 34 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Naturally, multiple works have investigated the fusion of information from different modalities [10,14,16]. In this work, we also investigate the effect of fusing our deep representations with Motion Boundary Histogram (motion) features [24] and MFCC (audio) features [14], both of which are encoded into a video representation using Fisher Vectors [19]. This fusion allows us to compare the effectiveness of our deep representations to heterogeneous representations and to investigate how well our deep representations fare when combined with other sources of information.…”

Section: Event Detection With Pre-trained Networkmentioning

confidence: 99%

The ImageNet Shuffle

Mettes

Koelma

Snoek

2016

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Self Cite

View full text Add to dashboard Cite

This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pretrained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/ imagenetshuffle.

show abstract

Section: Event Detection With Pre-trained Networkmentioning

confidence: 99%

The ImageNet Shuffle

Mettes

Koelma

Snoek

2016

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many previous approaches have considered one particular observable property, such as trajectories [13], motion features [4], or group interactions [7]. Multimodal fusion has been an active research topic, in particular to detect specific video concepts [14]. Our approach will also comprise various observables, which can be derived from the visual source of a camera feed as this is the most commonly available sensor in surveillance.…”

Section: Related Workmentioning

confidence: 99%

“…Further, multi-level approaches decrease the semantic gap between low-level features and complex behaviours [8]. An intermediate-level representation with dedicated components for the complex behaviours of interest has been successful for highly semantic phenomena such as the TRECVID MED competition [14,16]. We follow this approach for threat detection and design a dedicated intermediate-level representation that captures semantic observables related to threats.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Instantaneous threat detection based on a semantic representation of activities, zones and trajectories

Burghouts¹,

Schutte²,

Hove³

et al. 2014

SIViP

View full text Add to dashboard Cite

Threat detection is a challenging problem, because threats appear in many variations and differences to normal behaviour can be very subtle. In this paper, we consider threats on a parking lot, where theft of a truck's cargo occurs. The theft takes place in very different forms, in the midst of many people who pose no threat. The threats range from explicit, e.g., a person attacking the truck driver, to implicit, e.g., somebody loitering and then fiddling with the exterior of the truck in order to open it. Our goal is a system that is able to recognize a threat instantaneously as they develop. Typical observables of the threats are a person's activity, presence in a particular zone, and the trajectory. The novelty of this paper is an encoding of these threat observables in a semantic, intermediate-level representation, based on low-level visual features that have no intrinsic semantic meaning themselves. The semantic representation encodes the notions of trajectories, zones and activities. The aim of this representation is to bridge the semantic gap between the low-level tracks and motion and the higher-level notion of threats. In our experiments, we demonstrate that our semantic representation is more descriptive for threat detection than directly using low-level features. We find that a person's activities are the most important elements of this semantic representation, followed by the person's trajectory. The proposed threat detection system is very accurate: 96.6% of the tracks are correctly interpreted, when considering the temporal context.

show abstract

“…36,37 Wang et al 34 integrated multiple descriptors into a new descriptor for subsequent processes of the BoVW framework using a simple strategy for feature weighting. Jain et al 35 presented an innovative motion descriptor named divergence-curl-shear (DCS), where a linear combination of kernel matrices belonging to each local descriptor is concatenated directly by the method of kernel average and then fed into the linear SVM.…”

Section: Introductionmentioning

confidence: 99%

Weighted score-level feature fusion based on Dempster–Shafer evidence theory for action recognition

Zhang

Jia

et al. 2018

J. Electron. Imag.

View full text Add to dashboard Cite

, "Weighted score-level feature fusion based on Dempster-Shafer evidence theory for action recognition," J. Electron. Imaging 27(1), 013021 (2018), doi: 10.1117/1.JEI.27.1.013021. Abstract. The majority of human action recognition methods use multifeature fusion strategy to improve the classification performance, where the contribution of different features for specific action has not been paid enough attention. We present an extendible and universal weighted score-level feature fusion method using the Dempster-Shafer (DS) evidence theory based on the pipeline of bag-of-visual-words. First, the partially distinctive samples in the training set are selected to construct the validation set. Then, local spatiotemporal features and pose features are extracted from these samples to obtain evidence information. The DS evidence theory and the proposed rule of survival of the fittest are employed to achieve evidence combination and calculate optimal weight vectors of every feature type belonging to each action class. Finally, the recognition results are deduced via the weighted summation strategy. The performance of the established recognition framework is evaluated on Penn Action dataset and a subset of the joint-annotated human metabolome database (sub-JHMDB). The experiment results demonstrate that the proposed feature fusion method can adequately exploit the complementarity among multiple features and improve upon most of the state-of-the-art algorithms on Penn Action and sub-JHMDB datasets. © The Authors. Published by SPIE under a Creative Commons Attribution 3.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.

show abstract

Evaluating multimedia features and fusion for example-based event detection

Cited by 29 publications

References 34 publications

The ImageNet Shuffle

The ImageNet Shuffle

Instantaneous threat detection based on a semantic representation of activities, zones and trajectories

Weighted score-level feature fusion based on Dempster–Shafer evidence theory for action recognition

Contact Info

Product

Resources

About