Audiovisual event detection towards scene understanding

Canton-­Ferrer, Cristian; Butko, Taras; Segura, Carlos; Giró, Xavier; Nadeu, Climent; Hernando, Javier; Casas, Josep R.

doi:10.1109/cvprw.2009.5204264

Cited by 10 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This has worked fairly well with DNNs in the past, and while the convolution step filters the signal, pooling is a more drastic step that reduces the detail in the data. The results suggest that this reduction in detail is not so important when pooling is along time, which can be observed by 6 Responses to filters shown in Fig. 4 for the sound of an "applause" the similar performance for 1 × 1 and 1 × 2, and 2 × 1 and 2 × 2, in Fig.…”

Section: Poolingsupporting

confidence: 52%

“…The field has attracted increasing attention in recent years including dedicated challenges such as CLEAR [3], and recently D-CASE [4], with tasks involving the detection of a known set of acoustic events happening in a smart room or office setting. In addition, AED applications range from rich transcription in speech communication [3,4] and scene understanding [5,6], to being a source of information for informed speech enhancement and ASR. Gaining access to richer acoustic event classifiers could effectively support speech detection and informed speech enhancement [2] by providing the system with details about what kind of noise surrounds the speakers, besides the obvious benefits of richer transcriptions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Espi

Fujimoto

Kinoshita

et al. 2015

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

In recent years, deep learning has not only permeated the computer vision and speech recognition research fields but also fields such as acoustic event detection (AED). One of the aims of AED is to detect and classify non-speech acoustic events occurring in conversation scenes including those produced by both humans and the objects that surround us. In AED, deep learning has enabled modeling of detail-rich features, and among these, high resolution spectrograms have shown a significant advantage over existing predefined features (e.g., Mel-filter bank) that compress and reduce detail. In this paper, we further asses the importance of feature extraction for deep learning-based acoustic event detection. AED, based on spectrogram-input deep neural networks, exploits the fact that sounds have "global" spectral patterns, but sounds also have "local" properties such as being more transient or smoother in the time-frequency domain. These can be exposed by adjusting the time-frequency resolution used to compute the spectrogram, or by using a model that exploits locality leading us to explore two different feature extraction strategies in the context of deep learning: (1) using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and (2) introducing the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED. An experimental evaluation shows that the approaches we describe outperform our state-of-the-art deep learning baseline with a noticeable gain in the CNN case and provides insights regarding CNN-based spectrogram characterization for AED.

show abstract

Section: Poolingsupporting

confidence: 52%

Section: Introductionmentioning

confidence: 99%

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Espi

Fujimoto

Kinoshita

et al. 2015

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…This idea was first presented in [11], where the detection of footsteps was improved by exploiting the velocity information obtained from a videobased person-tracking system. Further improvement was shown in our previous papers [12,13], where the concept of multimodal AED is extended to detect and recognize the set of 11 AEs. In that work, not only video information but also acoustic source localization information was considered.…”

Section: Introductionmentioning

confidence: 95%

Acoustic Event Detection Based on Feature-Level Fusion of Audio and Video Modalities

Butko

Canton-Ferrer

Segura

et al. 2011

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

Acoustic event detection (AED) aims at determining the identity of sounds and their temporal position in audio signals. When applied to spontaneously generated acoustic events, AED based only on audio information shows a large amount of errors, which are mostly due to temporal overlaps. Actually, temporal overlaps accounted for more than 70% of errors in the realworld interactive seminar recordings used in CLEAR 2007 evaluations. In this paper, we improve the recognition rate of acoustic events using information from both audio and video modalities. First, the acoustic data are processed to obtain both a set of spectrotemporal features and the 3D localization coordinates of the sound source. Second, a number of features are extracted from video recordings by means of object detection, motion analysis, and multicamera person tracking to represent the visual counterpart of several acoustic events. A feature-level fusion strategy is used, and a parallel structure of binary HMM-based detectors is employed in our work. The experimental results show that information from both the microphone array and video cameras is useful to improve the detection rate of isolated as well as spontaneously generated acoustic events.

show abstract

“…The goal is to process a continuous acoustic signal and convert it into a sequence of event labels with associated start and end times. This has direct applications to rich transcription in speech communication [1,2] and scene understanding [3,4], but also as a source of information for informed speech enhancement and automatic speech recognition (ASR) systems. State-of-the-art hands-free meeting analysis systems already include simple event detection components in order to differentiate speech from laughter [5], but gaining access to richer acoustic event classifiers could effectively support speech detection and informed speech enhancement [6] by providing the system with details on what kind of noise surrounds the speakers, besides the obvious benefits of richer transcriptions.…”

Section: Introductionmentioning

confidence: 99%

Spectrogram patch based acoustic event detection and classification in speech overlapping conditions

Espi

Fujimoto

Kubo

et al. 2014

2014 4th Joint Workshop on Hands-Free Speech Communication and Microphone Arrays (HSCMA)

View full text Add to dashboard Cite

Speech does not always contain all the information needed to understand a conversation scene. Non-speech events can reveal aspects of the scene that speakers miss or neglect to mention, which could further support speech enhancement and recognition systems with information about the surrounding noise. This paper focuses on the task of detecting and classifying acoustic events in a conversation scene where these often overlap with speech. State-of-theart techniques are based on derived features (e.g. MFCC, or Melfilter banks), which have successfully parameterized speech spectrograms, but that reduce both resolution and detail when we are targeting other kinds of events. In this paper, we propose a method that learns hidden features directly from spectrogram patches, and integrates them within the deep neural network framework to detect and classify acoustic events. The result is a model that performs feature extraction and classification simultaneously. Experiments confirm that the proposed method outperforms deep neural networks with derived features as well as related work on the CHIL2007-AED task, showing that there is room for further improvement.

show abstract

Audiovisual event detection towards scene understanding

Cited by 10 publications

References 11 publications

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Acoustic Event Detection Based on Feature-Level Fusion of Audio and Video Modalities

Spectrogram patch based acoustic event detection and classification in speech overlapping conditions

Contact Info

Product

Resources

About