2013
DOI: 10.1007/s00138-013-0527-8
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating multimedia features and fusion for example-based event detection

Abstract: Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; highl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2013
2013
2020
2020

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 29 publications
(14 citation statements)
references
References 34 publications
(27 reference statements)
0
14
0
Order By: Relevance
“…Naturally, multiple works have investigated the fusion of information from different modalities [10,14,16]. In this work, we also investigate the effect of fusing our deep representations with Motion Boundary Histogram (motion) features [24] and MFCC (audio) features [14], both of which are encoded into a video representation using Fisher Vectors [19]. This fusion allows us to compare the effectiveness of our deep representations to heterogeneous representations and to investigate how well our deep representations fare when combined with other sources of information.…”
Section: Event Detection With Pre-trained Networkmentioning
confidence: 99%
“…Naturally, multiple works have investigated the fusion of information from different modalities [10,14,16]. In this work, we also investigate the effect of fusing our deep representations with Motion Boundary Histogram (motion) features [24] and MFCC (audio) features [14], both of which are encoded into a video representation using Fisher Vectors [19]. This fusion allows us to compare the effectiveness of our deep representations to heterogeneous representations and to investigate how well our deep representations fare when combined with other sources of information.…”
Section: Event Detection With Pre-trained Networkmentioning
confidence: 99%
“…Many previous approaches have considered one particular observable property, such as trajectories [13], motion features [4], or group interactions [7]. Multimodal fusion has been an active research topic, in particular to detect specific video concepts [14]. Our approach will also comprise various observables, which can be derived from the visual source of a camera feed as this is the most commonly available sensor in surveillance.…”
Section: Related Workmentioning
confidence: 99%
“…Further, multi-level approaches decrease the semantic gap between low-level features and complex behaviours [8]. An intermediate-level representation with dedicated components for the complex behaviours of interest has been successful for highly semantic phenomena such as the TRECVID MED competition [14,16]. We follow this approach for threat detection and design a dedicated intermediate-level representation that captures semantic observables related to threats.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…36,37 Wang et al 34 integrated multiple descriptors into a new descriptor for subsequent processes of the BoVW framework using a simple strategy for feature weighting. Jain et al 35 presented an innovative motion descriptor named divergence-curl-shear (DCS), where a linear combination of kernel matrices belonging to each local descriptor is concatenated directly by the method of kernel average and then fed into the linear SVM.…”
Section: Introductionmentioning
confidence: 99%