Look Who&amp;#8217;s Talking: Active Speaker Detection in the Wild

Kim, You Jin; Heo, Hee-Soo; Choe, Soyeon; Chung, Soo-Whan; Kwon, Yoohwan; Lee, Bong-Jin; Kwon, Youngki; Chung, Joon Son

doi:10.21437/interspeech.2021-2041

Cited by 8 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several AVSR architectures have been proposed [4,10,17,13,16,22,23] which show that the improvement over ASR models is greater as the noise level increases, i.e., the SNR is lower. The same VSR architectures can also be used to improve the performance of audio-based models in a variety of applications like speech enhancement [24], speech separation [25,26], voice activity detection [27], active speaker detection [28] and speaker diarisation [29].…”

Section: Applicationsmentioning

confidence: 99%

Visual Speech Recognition for Multiple Languages in the Wild

Ma¹,

Petridis²,

Pantić³

2022

Preprint

View full text Add to dashboard Cite

Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to larger training sets rather than the model design. In this work, we demonstrate that designing better models is equally important to using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations.We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show furthermore that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

show abstract

Section: Applicationsmentioning

confidence: 99%

Visual Speech Recognition for Multiple Languages in the Wild

Ma¹,

Petridis²,

Pantić³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The goal of noise-tolerant speaker diarization is to achieve improved performance in noisy environments. A recent work [19] tackles this problem using the auto-encoder architecture as a dimensionality reduction module. They extract two low-dimensional codes from speaker embeddings, representing the speaker identity and irrelevant noise information, then remove the noise factors.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Making Speaker Diarization System Noise Tolerant

Karamyan,

Kirakosyan,

Harutyunyan

2023

MPCS

View full text Add to dashboard Cite

The goal of speaker diarization is to identify and separate different speakers in a multi-speaker audio recording. However, noise in the recording can interfere with the accuracy of these systems. In this paper, we explore methods such as multi-condition training, consistency regularization, and teacher-student techniques to improve the resilience of speaker embedding extractors to noise. We test the effectiveness of these methods on speaker verification and speaker diarization tasks and demonstrate that they lead to improved performance in the presence of noise and reverberation. To test the speaker verification and diarization system under noisy and reverberant conditions, we created augmented versions of the VoxCeleb1 cleaned test and Voxconverse dev datasets by adding noise and echo with different SNR values. Our results show that, on average, we can achieve a 19.1% relative improvement in speaker recognition using the teacher-student method and a 17% relative improvement in speaker diarization using consistency regularization compared to a multi-condition trained baseline.

show abstract

AS-Net: active speaker detection using deep audio-visual attention

Radman,

Laaksonen

2024

Multimed Tools Appl

View full text Add to dashboard Cite

Active Speaker Detection (ASD) aims at identifying the active speaker among multiple speakers in a video scene. Previous ASD models often seek audio and visual features from long video clips with a complex 3D Convolutional Neural Network (CNN) architecture. However, models based on 3D CNNs can generate discriminative spatial-temporal features, but this comes at the expense of computational complexity, and they frequently face challenges in detecting active speakers in short video clips. This work proposes the Active Speaker Network (AS-Net) model, a simple yet effective ASD method tailored for detecting active speakers in relatively short video clips without relying on 3D CNNs. Instead, it incorporates the Temporal Shift Module (TSM) into 2D CNNs, facilitating the extraction of dense temporal visual features without the need for additional computations. Moreover, self-attention and cross-attention schemes are introduced to enhance long-term temporal audio-visual synchronization, thereby improving ASD performance. Experimental results demonstrate that AS-Net outperforms state-of-the-art 2D CNN-based methods on the AVA-ActiveSpeaker dataset and remains competitive with the methods utilizing more complex architectures.

show abstract

Look Who’s Talking: Active Speaker Detection in the Wild

Cited by 8 publications

References 0 publications

Visual Speech Recognition for Multiple Languages in the Wild

Visual Speech Recognition for Multiple Languages in the Wild

Making Speaker Diarization System Noise Tolerant

AS-Net: active speaker detection using deep audio-visual attention

Contact Info

Product

Resources

About