2016
DOI: 10.1007/978-3-319-46454-1_18
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modal Supervision for Learning Active Speaker Detection in Video

Abstract: Abstract. In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion -facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
72
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 48 publications
(74 citation statements)
references
References 35 publications
1
72
0
Order By: Relevance
“…In this paper, we use the above video-based person-specific active speaker detection models to train personalized audio voice models. This further improves the performance of the detection of active speakers in the dataset used by [3], to almost 100%.…”
Section: Introductionmentioning
confidence: 90%
See 4 more Smart Citations
“…In this paper, we use the above video-based person-specific active speaker detection models to train personalized audio voice models. This further improves the performance of the detection of active speakers in the dataset used by [3], to almost 100%.…”
Section: Introductionmentioning
confidence: 90%
“…We use Improved Trajectory (IT) features, spatio-temporal features originally used for action recognition [20], and adapted by [2,3] for active speaker detection. These features are a concatenation of Histogram of Oriented Gradients (HoG), Histogram of Optical Flow (HoF) and Motion Boundary Histogram (MBH) features calculated around feature points tracked over a sequence of 15 frames.…”
Section: Video-based Active Speaker Detectionmentioning
confidence: 99%
See 3 more Smart Citations