AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Roth, Joseph; Chaudhuri, Sourish; Klejch, Ondřej; Marvin, Radhika; Gallagher, Andrew; Kaver, Liat; Ramaswamy, Sharadh; Stopczynski, Arkadiusz; Schmid, Cordelia; Xi, Zhonghua; Pantofaru, Caroline

doi:10.48550/arxiv.1901.01342

Cited by 12 publications

(35 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the video domain, a common multi-modal paradigm involves combining representations from both visual and audio features [4,7,21,32,33,36,48]. Such representations have attracted the interest of the com-puter vision community, as they allow exploring new approaches to well established problems, such as person reidentification [32,24,54], audio-visual synchronization [1,8,9], speaker diarization [43,47,58], bio-metrics [33,39], and audio-visual source separation [4,21,36,40,48]. Active speaker detection is a special instance of audiovisual source separation, where sources are the visible persons in a video, and the goal is to detect and assign a segment of speech to one of those candidates.…”

Section: Related Workmentioning

confidence: 99%

“…Active speaker detection is a special instance of audiovisual source separation, where sources are the visible persons in a video, and the goal is to detect and assign a segment of speech to one of those candidates. The selected candidate is known as the active speaker [40].…”

Section: Related Workmentioning

confidence: 99%

“…To address the lack of a large scale testbed, the work of Roth et al [40], introduced the AVA-ActiveSpeaker dataset and benchmark, the first large-scale video dataset for the active speaker detection task. Upon the release of this dataset and its baseline, some novel approaches have been published.…”

Section: Related Workmentioning

confidence: 99%

“…Despite its multiple applications such as speaker diarization [3,43,45,47], human-computer interaction [15,57] and bio-metrics [33,39], the detection of active speakers in-the-wild remains an open problem. Recently, the AVA Active-Speaker dataset and benchmark [40] has provided the first large-scale standard benchmark for evaluating this problem, thereby enabling it to be approached with modern machine learning techniques.…”

Section: Introductionmentioning

confidence: 99%

“…Current approaches for active speaker detection are based on recurrent neural networks [2,40,42], or 3D convolutional models [1,6,59]. Their main focus is to jointly model audio and visual streams to maximize the single speaker prediction confidence over short sequences.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MAAS: Multi-modal Assignation for Active Speaker Detection

León-Alcázar¹,

Heilbron²,

Thabet³

et al. 2021

Preprint

View full text Add to dashboard Cite

Active speaker detection requires a solid integration of multi-modal cues. While individual modalities can approximate a solution, accurate predictions can only be achieved by explicitly fusing the audio and visual features and modeling their temporal progression. Despite its inherent mutimodal nature, current methods still focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. In this paper we present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem, and provides a straightforward strategy where independent visual features from potential speakers in the scene are assigned to a previously detected speech event. Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem. Moreover, the temporal extension of this initial graph achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 88.8%.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MAAS: Multi-modal Assignation for Active Speaker Detection

León-Alcázar¹,

Heilbron²,

Thabet³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Grauman

Westbury²,

Byrne³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

166

153

View full text Add to dashboard Cite

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.

show abstract

Self-Supervised Learning for Audio-Visual Speaker Diarization

Ding

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in humancentered applications such as video conferences or humancomputer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world humancomputer interaction system and the results show our best model yields a remarkable gain of +8% F 1 -scores as well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video dataset in Chinese.

show abstract

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Cited by 12 publications

References 0 publications

MAAS: Multi-modal Assignation for Active Speaker Detection

MAAS: Multi-modal Assignation for Active Speaker Detection

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Self-Supervised Learning for Audio-Visual Speaker Diarization

Contact Info

Product

Resources

About