The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413776
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Target DoA Estimation with an Audio-Visual Fusion Mechanism

Abstract: Most of the prior studies in the spatial Direction of Arrival (DoA) domain focus on a single modality. However, humans use auditory and visual senses to detect the presence of sound sources. With this motivation, we propose to use neural networks with audio and visual signals for multi-speaker localization. The use of heterogeneous sensors can provide complementary information to overcome uni-modal challenges, such as noise, reverberation, illumination variations, and occlusions. We attempt to address these is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
27
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

3
6

Authors

Journals

citations
Cited by 31 publications
(27 citation statements)
references
References 21 publications
0
27
0
Order By: Relevance
“…7) most of the time, although the improvements are not as high as under conditions with missing visual frames (Sys. [1][2][3][4][5][6].…”
Section: Ablation Studiesmentioning
confidence: 99%
See 1 more Smart Citation
“…7) most of the time, although the improvements are not as high as under conditions with missing visual frames (Sys. [1][2][3][4][5][6].…”
Section: Ablation Studiesmentioning
confidence: 99%
“…Speech is the most natural way of communication between humans. Therefore, the study and development of human-machine interaction systems, such as active speaker detection [1], speaker localization [2], speech recognition [3], and emotion recognition [4] constitutes an important part in today's research. However, these algorithms are adversely affected by the presence of interference speakers and acoustic noise.…”
Section: Introductionmentioning
confidence: 99%
“…However, the performance of computer processing of speech, such as automatic speech recognition [1], speaker localization [2], active speaker detection [3], and speech emotion recognition [4] degrades dramatically in the presence of interfering speakers. This prompts us to study ways to extract speech similar to how humans perceive.…”
Section: Introductionmentioning
confidence: 99%
“…The acoustic environment during real-world human-robot interaction can be described as a cocktail party [1], where the speech from a speaker of interest, i.e., target speaker, is often corrupted by interference speakers and background noise. In such a scenario, speech separation or speaker extraction algorithms are usually needed to extract the clean speech signal of the target speaker [2,3], which is a crucial step for downstream applications such as hearing aid development [4], automatic speech recognition [5], and source localization [6].…”
Section: Introductionmentioning
confidence: 99%