ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053825
|View full text |Cite
|
Sign up to set email alerts
|

Static Visual Spatial Priors for DoA Estimation

Abstract: As we interact with the world, for example when we communicate with our colleagues in a large open space or meeting room, we continuously analyse the surrounding environment and, in particular, localise and recognise acoustic events. While we largely take such abilities for granted, they represent a challenging problem for current robots or smart voice assistants as they can be easily fooled by high degree of sound interference in acoustically complex environments. Preventing such failures when using solely au… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 43 publications
(49 reference statements)
0
2
0
Order By: Relevance
“…5 B). When presence of a user is detected and the device has some update ready, or a voicetrigger is spotted, the device wakes up and faces the user [47], starting to process at the same time the audio-visual data. This, depending on the compute requirements can happen either on device or in the cloud (Fig.…”
Section: Softwarementioning
confidence: 99%
See 1 more Smart Citation
“…5 B). When presence of a user is detected and the device has some update ready, or a voicetrigger is spotted, the device wakes up and faces the user [47], starting to process at the same time the audio-visual data. This, depending on the compute requirements can happen either on device or in the cloud (Fig.…”
Section: Softwarementioning
confidence: 99%
“…Additionally, we estimate direction of arrival (DOA) θ s for each of the detected sounds ′ s ′ using a set of DOA estimates from the raw signal (as many as detected acoustic events at each given time step), which are then mapped to x, y, z coordinates 3 . This process can leverage an additional semantic information from vision stream, as shown in [47]. The most likely pairs {acoustic_event, θ s } for co-occurring events are estimated in the spatial model using visual data.…”
Section: Semantic Scene Understandingmentioning
confidence: 99%