2021
DOI: 10.1109/tpami.2019.2952095
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Abstract: Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
37
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 42 publications
(37 citation statements)
references
References 45 publications
0
37
0
Order By: Relevance
“…With the outbreak of deep learning techniques, the field of audio-visual learning has received a significant boost, especially the problems formulated in unsupervised and selfsupervised manners. Along this line of research there have been some works focused on representation learning with further applications in audio classification, action recognition and source localisation [18], [37], [38], [39], [40], [41], [42], [43]. Most of them combined features from two-stream networks (one sub-network for the audio and another one for the visual modality) either by concatenating them or by having an additional attention module.…”
Section: Audio-visual Deep Learning Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…With the outbreak of deep learning techniques, the field of audio-visual learning has received a significant boost, especially the problems formulated in unsupervised and selfsupervised manners. Along this line of research there have been some works focused on representation learning with further applications in audio classification, action recognition and source localisation [18], [37], [38], [39], [40], [41], [42], [43]. Most of them combined features from two-stream networks (one sub-network for the audio and another one for the visual modality) either by concatenating them or by having an additional attention module.…”
Section: Audio-visual Deep Learning Methodsmentioning
confidence: 99%
“…Most of them combined features from two-stream networks (one sub-network for the audio and another one for the visual modality) either by concatenating them or by having an additional attention module. Some of them employed time synchrony for the samples of the same video [18], [44], while others learnt to extract features by identifying if the audio sample corresponded to a given visual data [18], [37], [40]. More recent work also focused on the usage of audio for distilling redundant visual information to reduce computational costs [41].…”
Section: Audio-visual Deep Learning Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…It is also possible to reduce the complexity of the action recognition process by exploiting the additional side information. For example, the sound information or pre-calculated features, which are normally existed in the compressed video data, can be utilized to provide the additional information for enabling the cliplevel processing [37] or selecting the dominant clips strongly related to the actions [38], [39], increasing the recognition accuracy while even reducing the computational complexity.…”
Section: B Related Workmentioning
confidence: 99%
“…Audio cue is also utilized to improve video recognition performance [15,45,47]. These multimodal inputs are used in other video tasks such as self-supervised learning in videos [1,25], sound localization [32,36] and generation [54,55] from videos as well. Existing fusion approaches typically use fixed fusion weights for all samples.…”
Section: Related Workmentioning
confidence: 99%