2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 2009
DOI: 10.1109/cvprw.2009.5204264
|View full text |Cite
|
Sign up to set email alerts
|

Audiovisual event detection towards scene understanding

Abstract: Acoustic events produced in meeting environments may contain useful information for perceptually aware interfaces and multimodal behavior analysis. In this paper, a system to detect and recognize these events from a multimodal perspective is presented combining information from multiple cameras and microphones. First, spectral and temporal features are extracted from a single audio channel and spatial localization is achieved by exploiting cross-correlation among microphone arrays. Second, several video cues o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
6
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 11 publications
1
6
0
Order By: Relevance
“…This has worked fairly well with DNNs in the past, and while the convolution step filters the signal, pooling is a more drastic step that reduces the detail in the data. The results suggest that this reduction in detail is not so important when pooling is along time, which can be observed by 6 Responses to filters shown in Fig. 4 for the sound of an "applause" the similar performance for 1 × 1 and 1 × 2, and 2 × 1 and 2 × 2, in Fig.…”
Section: Poolingsupporting
confidence: 52%
See 1 more Smart Citation
“…This has worked fairly well with DNNs in the past, and while the convolution step filters the signal, pooling is a more drastic step that reduces the detail in the data. The results suggest that this reduction in detail is not so important when pooling is along time, which can be observed by 6 Responses to filters shown in Fig. 4 for the sound of an "applause" the similar performance for 1 × 1 and 1 × 2, and 2 × 1 and 2 × 2, in Fig.…”
Section: Poolingsupporting
confidence: 52%
“…The field has attracted increasing attention in recent years including dedicated challenges such as CLEAR [3], and recently D-CASE [4], with tasks involving the detection of a known set of acoustic events happening in a smart room or office setting. In addition, AED applications range from rich transcription in speech communication [3,4] and scene understanding [5,6], to being a source of information for informed speech enhancement and ASR. Gaining access to richer acoustic event classifiers could effectively support speech detection and informed speech enhancement [2] by providing the system with details about what kind of noise surrounds the speakers, besides the obvious benefits of richer transcriptions.…”
Section: Introductionmentioning
confidence: 99%
“…This idea was first presented in [11], where the detection of footsteps was improved by exploiting the velocity information obtained from a videobased person-tracking system. Further improvement was shown in our previous papers [12,13], where the concept of multimodal AED is extended to detect and recognize the set of 11 AEs. In that work, not only video information but also acoustic source localization information was considered.…”
Section: Introductionmentioning
confidence: 95%
“…The goal is to process a continuous acoustic signal and convert it into a sequence of event labels with associated start and end times. This has direct applications to rich transcription in speech communication [1,2] and scene understanding [3,4], but also as a source of information for informed speech enhancement and automatic speech recognition (ASR) systems. State-of-the-art hands-free meeting analysis systems already include simple event detection components in order to differentiate speech from laughter [5], but gaining access to richer acoustic event classifiers could effectively support speech detection and informed speech enhancement [6] by providing the system with details on what kind of noise surrounds the speakers, besides the obvious benefits of richer transcriptions.…”
Section: Introductionmentioning
confidence: 99%