Proceedings of the 2015 ACM on International Conference on Multimodal Interaction 2015
DOI: 10.1145/2818346.2820780
|View full text |Cite
|
Sign up to set email alerts
|

Who's Speaking?

Abstract: Active speakers have traditionally been identified in video by detecting their moving lips. This paper demonstrates the same using spatio-temporal features that aim to capture other cues: movement of the head, upper body and hands of active speakers. Speaker directional information, obtained using sound source localization from a microphone array is used to supervise the training of these video features.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
5
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 28 publications
(6 citation statements)
references
References 12 publications
0
5
0
Order By: Relevance
“…They proposed a time-delayed neural network to learn the audiovisual correlations from speech activity. P. Chakravarty [30] re-explored leveraging audio as supervision using rich alignment between audio and visual information. Following this, [13], [17], [32] proposed a model that jointly trains an audiovisual embedding that enables more accurate active speaker detection.…”
Section: Related Workmentioning
confidence: 99%
“…They proposed a time-delayed neural network to learn the audiovisual correlations from speech activity. P. Chakravarty [30] re-explored leveraging audio as supervision using rich alignment between audio and visual information. Following this, [13], [17], [32] proposed a model that jointly trains an audiovisual embedding that enables more accurate active speaker detection.…”
Section: Related Workmentioning
confidence: 99%
“…Active Speaker Detection. Works on ASD have evolved from facial visual cues [21,34,42] to audio as primary source [6,18], to multi-modal data combination [3,4,29,40,46]. Since the introduction of AVA-ActiveSpeaker [40], combining audio with facial features is the de facto way to predict active speakers.…”
Section: Related Workmentioning
confidence: 99%
“…Other methods approach the task of ASL, which seeks to localize speakers spatially within the scene rather than classifying bounding box tracks [7,16,24,37,38,87,104,106]. Several use multichannel audio to incorporate directional audio information [7,16,24,37,38,104,106]. Recently,…”
Section: Related Workmentioning
confidence: 99%