Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning

Mühling, Markus; Ewerth, Ralph; Zhou, Jun; Freisleben, Bernd

doi:10.1007/978-3-642-27355-1_7

Cited by 13 publications

(13 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Finally, the results show that CAT and MKL react similarly when reducing the size of the training set. This is in disagreement with the literature (see [3]). Notice, however, that the experimental conditions are not the same.…”

Section: Resultscontrasting

confidence: 97%

“…It is also worth to notice that the MKL and the CAT methods are comparable and both perform better than the CWS approach. This last statement is in disagreement with [3], where MKL outperforms CAT. Albeit, the experimental conditions are not the same.…”

Section: Resultsmentioning

confidence: 79%

“…The authors evaluate the audiovisual event detection performance on a dataset of about 9000 sequences. In [3] audio-visual video concept detection is targeted and the approach consists of concatenating the visual and auditory descriptors, thus forming an audio-visual representation. Tests are performed on a dataset of around 45,000 videos.…”

Section: Introductionmentioning

confidence: 99%

“…Late Fusion: Also in [3], the auditory and visual representation are fused through Multiple Kernel Learning (MKL). This technique is popular because the relative relevance of different kernels is learned from the data.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Benchmarking methods for audio-visual recognition using tiny training sets

Alameda-Pineda

Sanchez-Riera

Horaud

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

The problem of choosing a classifier for audio-visual command recognition is addressed. Because such commands are culture-and user-dependant, methods need to learn new commands from a few examples. We benchmark three state-ofthe-art discriminative classifiers based on bag of words and SVM. The comparison is made on monocular and monaural recordings of a publicly available dataset. We seek for the best trade off between speed, robustness and size of the training set. In the light of over 150,000 experiments, we conclude that this is a promising direction of work towards a flexible methodology that must be easily adaptable to a large variety of users.

show abstract

Section: Resultscontrasting

confidence: 97%

Section: Resultsmentioning

confidence: 79%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Benchmarking methods for audio-visual recognition using tiny training sets

Alameda-Pineda

Sanchez-Riera

Horaud

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…In early fusion the audio and the visual features are combined before classification [12] while in late fusion the classification scores from the individual feature models are combined [9,11,16]. The kernel fusion can be considered as an intermediate fusion, the audio and the visual features at the kernel level are merge before performing the classification [1,17]. These methods fuse the audio and visual modalities without considering their correlations.…”

Section: Related Workmentioning

confidence: 99%

Joint Audio-Visual Words for Violent Scenes Detection in Movies

Derbas

Quénot

2014

Proceedings of International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

This paper presents an audio-visual data representation for violent scenes detection in movies. Existing works in this field consider either the audio or the visual information; or their shallow fusion. None has yet explored their joint dependence for violent scenes detection. We propose a feature which provides strong multi-modal audio and visual cues by first joining the audio and the visual features and then revealing statistically the joint multi-modal patterns. Experimental validation was conducted in the context of the Violent Scenes Detection task of the MediaEval 2013 Multimedia benchmark. The obtained results show the potential of the proposed approach in comparison to methods using audio and visual features separately and other fusion methods.

show abstract

Semantic Concept Detection for Multilabel Unbalanced Dataset Using Global Features

Patil

Sawarkar

2019

Intelligent Communication Technologies and Virtual Mobile Networks

View full text Add to dashboard Cite

Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning

Cited by 13 publications

References 12 publications

Benchmarking methods for audio-visual recognition using tiny training sets

Benchmarking methods for audio-visual recognition using tiny training sets

Joint Audio-Visual Words for Violent Scenes Detection in Movies

Semantic Concept Detection for Multilabel Unbalanced Dataset Using Global Features

Contact Info

Product

Resources

About