Look who's talking: speaker detection using video and audio correlation

Cutler, Ross; Davis, Larry S.

doi:10.1109/icme.2000.871073

Cited by 108 publications

(94 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Broadly speaking, the differences among existing works arise from the overall goal (tracking single vs. multiple speakers), the specific detection/tracking framework, and the AV sensor configuration. Much work has concentrated on the single-speaker case, assuming either single-person scenes [7,34,1], or multiperson scenes where only the location of the current speaker needs to be tracked [36,17,13,43,48,3]. Many of these works used simple sensor configurations (e.g.…”

Section: Related Workmentioning

confidence: 99%

“…Localizing and tracking speakers in enclosed spaces using AV information has increasingly attracted attention in signal processing and computer vision [36,17,7,34,13,43,48,1,3,6,5], given the complementary characteristics of each modality. Broadly speaking, the differences among existing works arise from the overall goal (tracking single vs. multiple speakers), the specific detection/tracking framework, and the AV sensor configuration.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Gática-Pérez

Lathoud

Odobez

et al. 2007

IEEE Trans. Audio Speech Lang. Process.

120

116

View full text Add to dashboard Cite

Abstract. Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results -based on an objective evaluation procedure-that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach. IDIAP-RR 05-27

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Gática-Pérez

Lathoud

Odobez

et al. 2007

IEEE Trans. Audio Speech Lang. Process.

120

116

View full text Add to dashboard Cite

show abstract

“…From the literature, several machine learning approaches are known that can be employed to perform this kind of sensor data fusion. For example, in [6] a time-delayed neural network (TDNN) is applied in an automatic lipreading system to fuse audio and visual data. In [11], another TDNN is applied to visual and audio data to detect when and where a person is speaking in a scene.…”

Section: Related Workmentioning

confidence: 99%

ART-Based Fusion of Multi-modal Information for Mobile Robots

Berghöfer

Schulze

Tscherepanow

et al. 2011

Engineering Applications of Neural Networks

View full text Add to dashboard Cite

Abstract. Robots operating in complex environments shared with humans are confronted with numerous problems. One important problem is the identification of obstacles and interaction partners. In order to reach this goal, it can be beneficial to use data from multiple available sources, which need to be processed appropriately. Furthermore, such environments are not static. Therefore, the robot needs to learn novel objects. In this paper, we propose a method for learning and identifying obstacles based on multi-modal information. As this approach is based on Adaptive Resonance Theory networks, it is inherently capable of incremental online learning.

show abstract

“…The effectiveness of fusing video and audio features for tracking was demonstrated in [1], [2], [3]. The success of the fusion strategy is mainly because each modality may compensate for the weaknesses of the other or can provide additional information ( [4], [5]). For example, a speaker identified via audio detection may trigger the camera zooming in a teleconference.…”

Section: Introductionmentioning

confidence: 99%

Target Detection and Tracking With Heterogeneous Sensors

Zhou

Taj

Cavallaro

2008

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Abstract-We present a multimodal detection and tracking algorithm for sensors composed of a camera mounted between two microphones. Target localization is performed on color-based change detection in the video modality and on Time Difference of Arrival (TDOA) estimation between the two microphones in the audio modality. The TDOA is computed by multi-band Generalized Cross Correlation (GCC) analysis. The estimated directions of arrival are then post-processed using a Riccati Kalman filter. The visual and audio estimates are finally integrated, at the likelihood level, into a particle filter (PF) that uses a zero-order motion model, and a Weighted Probabilistic Data Association (WPDA) scheme. We demonstrate that the Kalman filtering (KF) improves the accuracy of the audio source localization and that the WPDA helps to enhance the tracking performance of sensor fusion in reverberant scenarios. The combination of multi-band GCC, KF and WPDA within the particle filtering framework improves the performance of the algorithm in noisy scenarios. We also show how the proposed audiovisual tracker summarizes the observed scene by generating metadata that can be transmitted to other network nodes instead of transmitting the raw images and can be used for very low bit rate communication. Moreover, the generated metadata can also be used to detect and monitor events of interest.Index Terms-Multimodal detection and tracking, Kalman filter, particle filter, heterogeneous sensors, low bit rate communication.

show abstract

Look who's talking: speaker detection using video and audio correlation

Cited by 108 publications

References 17 publications

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

ART-Based Fusion of Multi-modal Information for Mobile Robots

Target Detection and Tracking With Heterogeneous Sensors

Contact Info

Product

Resources

About