2022
DOI: 10.1109/tmm.2021.3061800
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Tracking of Concurrent Speakers

Abstract: Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, especially when sound and video are collected with a compact sensing platform. In this paper, we propose a tracker that builds on generative and discriminative audio-visual likelihood models formulated in a particle filtering framework. We localize multiple concurrent speakers with a de-emphasized acoustic map assisted by the image detection-derived 3D video observations. The 3D multimodal observations are either ass… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 16 publications
(15 citation statements)
references
References 58 publications
0
15
0
Order By: Relevance
“…With the development of deep learning techniques, more and more research attention has been paid to how to combine/fuse audio and visual for vision-related tasks, e.g., sounding object localization [18], audio-visual synchronization [73], object tracking [74], and saliency detection [75]. Though the primary focus of this review is on saliency detection, we shall still review several most representative audiovisual-related tasks [76], [77], [78] in advance because these fusion-related arts can be directly referred to and get a deep insight into our audio-visual saliency detection. For a better reading, we propose to introduce three most representative tasks here, including audio-visual correspondence (AVC), face and audio matching (FAM) and sound-object localization (SOL).…”
Section: Audio-visual Multi-modality Fusionmentioning
confidence: 99%
“…With the development of deep learning techniques, more and more research attention has been paid to how to combine/fuse audio and visual for vision-related tasks, e.g., sounding object localization [18], audio-visual synchronization [73], object tracking [74], and saliency detection [75]. Though the primary focus of this review is on saliency detection, we shall still review several most representative audiovisual-related tasks [76], [77], [78] in advance because these fusion-related arts can be directly referred to and get a deep insight into our audio-visual saliency detection. For a better reading, we propose to introduce three most representative tasks here, including audio-visual correspondence (AVC), face and audio matching (FAM) and sound-object localization (SOL).…”
Section: Audio-visual Multi-modality Fusionmentioning
confidence: 99%
“…To evaluate the TLR in 3D, a target is considered to be lost if the error with respect to the ground-truth is larger than 300 mm. We also use a fine error metric defined as 3D , where only the frames where tracking is successful are considered in Equation (27).…”
Section: Evaluation Metricsmentioning
confidence: 99%
“…Active speaker detection is a multi-modal task aiming to identify active speakers from a set of candidates in an arbitrary video. This task plays an essential role in speaker diarization [7,41], speaker tracking [27,28], automatic video editing [10,19], and other applications, which has attracted extensive attention from both industry and academia. Figure 1.…”
Section: Introductionmentioning
confidence: 99%