Cross-Modal Supervision for Learning Active Speaker Detection in Video

Chakravarty, Punarjay; Tuytelaars, Tinne

doi:10.1007/978-3-319-46454-1_18

Cited by 48 publications

(74 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we use the above video-based person-specific active speaker detection models to train personalized audio voice models. This further improves the performance of the detection of active speakers in the dataset used by [3], to almost 100%.…”

Section: Introductionmentioning

confidence: 90%

“…We use Improved Trajectory (IT) features, spatio-temporal features originally used for action recognition [20], and adapted by [2,3] for active speaker detection. These features are a concatenation of Histogram of Oriented Gradients (HoG), Histogram of Optical Flow (HoF) and Motion Boundary Histogram (MBH) features calculated around feature points tracked over a sequence of 15 frames.…”

Section: Video-based Active Speaker Detectionmentioning

confidence: 99%

“…The use of spatio-temporal features was found to outperform lipmotion to detect speaking [4,7,17], especially in videos with multiple speakers in the scene that lack adequate resolution to discern the motion of individual lips. In further work, Chakravarty et al [3] demonstrated that a generic visual Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.…”

Section: Introductionmentioning

confidence: 99%

“…VAD was subsequently used to modify and adapt this generic visual model to individual speakers [3]. These individual visual active speaker detection models are used in this work, to learn person-specific audio voice models, to achieve almost perfect classification results.…”

Section: Introductionmentioning

confidence: 99%

“…With this, we propose a method to close the loop in active speaker detection. Earlier work [2,3] demonstrated the ability of using audio to supervise video. In this paper, we show the reverse, that the learnt video models are capable of successfully supervising the training of audio voice models, thus demonstrating co-training.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Active speaker detection with audio-visual co-training

Chakravarty

Zegers

Tuytelaars

et al. 2016

Proceedings of the 18th ACM International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision -audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

show abstract

Section: Introductionmentioning

confidence: 90%

Section: Video-based Active Speaker Detectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Active speaker detection with audio-visual co-training

Chakravarty

Zegers

Tuytelaars

et al. 2016

Proceedings of the 18th ACM International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

show abstract

Comparisons of Visual Activity Primitives for Voice Activity Detection

Shahid

Beyan

Murino

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Voice activity detection (VAD) with solely visual cues have usually performed by detecting lip motion, which is not always feasible. On the other hand, visual activity (e.g., head, hand or whole body motion) is also correlated with speech, and can be used for VAD. Convolutional Neural Networks (CNNs) have demonstrated significantly good results for many applications including visual activity-related tasks. It can be possible to exploit CNN's effectiveness to visual-VAD when whole body visual activity is used. The way visual activity is represented (called visual activity primitives) to be given to a CNN as input, might be important to perform an effective VAD. Some primitives might result in better detection and provide consistent VAD performance such that the detector works equally well for all speakers. This is investigated, for the first time, in this paper. Regarding that, we compare visual activity primitives quantitatively in terms of the overall performance and the standard deviation of the performance, and qualitatively by visualizing the discriminative image regions determined by CNN trained to identify VAD classes. We perform a data-driven VAD with a person-invariant training i.e., without using any labels or features of the test data. This is unlike the state-of-the-art (SOA), which realizes a person-specific VAD with hand-crafted features. Improved performances with much lower standard deviation as compared to SOA are demonstrated.

show abstract