2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World Of
DOI: 10.1109/icme.2000.871073
|View full text |Cite
|
Sign up to set email alerts
|

Look who's talking: speaker detection using video and audio correlation

Abstract: The visual motion of the mouth and the corresponding audio data generated when a person speaks are highly correlated. This fact has been exploited for liphpeechreading and for improving speech recognition. We describe a method of automatically detecting a talking person (both spatially and temporally) using video and audio data from a single microphone. The audio-visual correlation is learned using a time delayed neural network, which is then used to perform a spatio-temporal search for a speaking person. Appl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
93
0

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 108 publications
(94 citation statements)
references
References 17 publications
1
93
0
Order By: Relevance
“…Broadly speaking, the differences among existing works arise from the overall goal (tracking single vs. multiple speakers), the specific detection/tracking framework, and the AV sensor configuration. Much work has concentrated on the single-speaker case, assuming either single-person scenes [7,34,1], or multiperson scenes where only the location of the current speaker needs to be tracked [36,17,13,43,48,3]. Many of these works used simple sensor configurations (e.g.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Broadly speaking, the differences among existing works arise from the overall goal (tracking single vs. multiple speakers), the specific detection/tracking framework, and the AV sensor configuration. Much work has concentrated on the single-speaker case, assuming either single-person scenes [7,34,1], or multiperson scenes where only the location of the current speaker needs to be tracked [36,17,13,43,48,3]. Many of these works used simple sensor configurations (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Localizing and tracking speakers in enclosed spaces using AV information has increasingly attracted attention in signal processing and computer vision [36,17,7,34,13,43,48,1,3,6,5], given the complementary characteristics of each modality. Broadly speaking, the differences among existing works arise from the overall goal (tracking single vs. multiple speakers), the specific detection/tracking framework, and the AV sensor configuration.…”
Section: Related Workmentioning
confidence: 99%
“…From the literature, several machine learning approaches are known that can be employed to perform this kind of sensor data fusion. For example, in [6] a time-delayed neural network (TDNN) is applied in an automatic lipreading system to fuse audio and visual data. In [11], another TDNN is applied to visual and audio data to detect when and where a person is speaking in a scene.…”
Section: Related Workmentioning
confidence: 99%
“…The effectiveness of fusing video and audio features for tracking was demonstrated in [1], [2], [3]. The success of the fusion strategy is mainly because each modality may compensate for the weaknesses of the other or can provide additional information ( [4], [5]). For example, a speaker identified via audio detection may trigger the camera zooming in a teleconference.…”
Section: Introductionmentioning
confidence: 99%