Who's Speaking?

Chakravarty, Punarjay; Mirzaei, Sayeh; Tuytelaars, Tinne; hamme, Hugo Van

doi:10.1145/2818346.2820780

Cited by 28 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They proposed a time-delayed neural network to learn the audiovisual correlations from speech activity. P. Chakravarty [30] re-explored leveraging audio as supervision using rich alignment between audio and visual information. Following this, [13], [17], [32] proposed a model that jointly trains an audiovisual embedding that enables more accurate active speaker detection.…”

Section: Related Workmentioning

confidence: 99%

Efficient Audiovisual Fusion for Active Speaker Detection

Tesema

Song

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV's effectiveness and robustness than the concatenation operation.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Audiovisual Fusion for Active Speaker Detection

Tesema

Song

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Active Speaker Detection. Works on ASD have evolved from facial visual cues [21,34,42] to audio as primary source [6,18], to multi-modal data combination [3,4,29,40,46]. Since the introduction of AVA-ActiveSpeaker [40], combining audio with facial features is the de facto way to predict active speakers.…”

Section: Related Workmentioning

confidence: 99%

WASD: A Wilder Active Speaker Detection Dataset

Roxo¹,

Costa²,

Inácio³

et al. 2023

Preprint

View full text Add to dashboard Cite

Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA), using only sound and facial features. Although this approach is applicable in movie setups (AVA), it is not suited for less constrained conditions. To demonstrate this limitation, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face. Grouped into 5 categories, ranging from optimal conditions to surveillance settings, WASD contains incremental challenges for ASD with tactical impairment of audio and face data. We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded). The results show that: 1) AVA trained models maintain a stateof-the-art performance in WASD Easy group, while underperforming in the Hard one, showing the 2) similarity between AVA and Easy data; and 3) training in WASD does not improve models performance to AVA levels, particularly for audio impairment and surveillance settings. This shows that AVA does not prepare models for wild ASD and current approaches are subpar to deal with such conditions. The proposed dataset also contains body data annotations to provide a new source for ASD, and is available at https://github.com/Tiago-Roxo/WASD.

show abstract

“…Other methods approach the task of ASL, which seeks to localize speakers spatially within the scene rather than classifying bounding box tracks [7,16,24,37,38,87,104,106]. Several use multichannel audio to incorporate directional audio information [7,16,24,37,38,104,106]. Recently,…”

Section: Related Workmentioning

confidence: 99%

Egocentric Auditory Attention Localization in Conversations

Ryan¹,

Jiang²,

Shukla³

et al. 2023

Preprint

View full text Add to dashboard Cite

In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal

show abstract

Who's Speaking?

Cited by 28 publications

References 12 publications

Efficient Audiovisual Fusion for Active Speaker Detection

Efficient Audiovisual Fusion for Active Speaker Detection

WASD: A Wilder Active Speaker Detection Dataset

Egocentric Auditory Attention Localization in Conversations

Contact Info

Product

Resources

About