Audio-Visual Model Distillation Using Acoustic Images

Pérez, Andrés F.; Sanguineti, Valentina; Morerio, Pietro; Murino, Vittorio

doi:10.48550/arxiv.1904.07933

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These may imply that, by the definition of learnability [12], the task is not a fully learnable problem only with unsupervised data in our setting, which is static-image based single-channel audio source localization, but can be fixed with even a small amount of relevant prior knowledge. Although the sound localization task is not effectively addressed with our unsupervised learning approach with static images and mono audios, other methods that use spatial microphones [25], [53], [54], [55] or temporal information, motion [8] and synchronization [18], with multiple frames have been shown to perform well on this task with unsupervised algorithms.…”

Section: Discussionmentioning

confidence: 99%

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Senocak

Kim

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 • videos.

show abstract