The Sound of Pixels

Zhao, Hang; Gan, Chuang; Rouditchenko, Andrew; Vondrick, Carl; McDermott, Josh H.; Torralba, Antonio

doi:10.1007/978-3-030-01246-5_35

Cited by 395 publications

(588 citation statements)

References 45 publications

Supporting

Mentioning

565

Contrasting

Unclassified

Order By: Relevance

“…Existing works [10,28] on visual sound separation mainly separate each sound independently. They assume either fixed types or fixed numbers of sounds, separating sounds independently.…”

Section: Introductionmentioning

confidence: 99%

Recursive Visual Sound Separation Using Minus-Plus Net

Dai

Lin

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred to as MinusPlus Network (MP-Net), for the task of visual sound separation. MP-Net separates sounds recursively in the order of average energy 1 , removing the separated sound from the mixture at the end of each prediction, until the mixture becomes empty or contains only noise. In this way, MP-Net could be applied to sound mixtures with arbitrary numbers and types of sounds. Moreover, while MP-Net keeps removing sounds with large energy from the mixture, sounds with small energy could emerge and become clearer, so that the separation is more accurate. Compared to previous methods, MP-Net obtains state-of-the-art results on two large scale datasets, across mixtures with different types and numbers of sounds.

show abstract

“…Existing works [10,28] on visual sound separation mainly separate each sound independently. They assume either fixed types or fixed numbers of sounds, separating sounds independently.…”

Section: Introductionmentioning

confidence: 99%

Recursive Visual Sound Separation Using Minus-Plus Net

Dai

Lin

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…With new advances in deep learning, this task has attracted more attention [2], [3], [5], [8], [9], [14]. Since the approach we propose recently [2], some interesting methods on the sound source localization task have been developed.…”

Section: Related Work and Problem Contextmentioning

confidence: 99%

“…Furthermore, our networks have an attention layer that interacts between the two modalities and reveals the localization information of the sound source. In [5], Zhao et al also explore the sound source localization in the musical instruments domain. On the other hand, several methods [8], [18] are designed to localize actions in videos, rather than objects in static images with an unsupervised learning method.…”

Section: Related Work and Problem Contextmentioning

confidence: 99%

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Senocak

Kim

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 • videos.

show abstract

“…[18,26] also explored how to generate spatial sound for videos. More recently, [40,39,13,17,32] used the visual-audio correspondence to separate sound sources. In contrast to previous work that has only transferred class-level information between modalities, this work transfers richer, region-level location information about objects.…”

Section: Cross-modal Self-supervised Learningmentioning

confidence: 99%

Self-Supervised Moving Vehicle Tracking With Stereo Sound

Gan

Zhao²,

Chen³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

159

102

View full text Add to dashboard Cite

Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of selfsupervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision "teacher" network and a stereosound "student" network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicle Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

show abstract

The Sound of Pixels

Cited by 395 publications

References 45 publications

Recursive Visual Sound Separation Using Minus-Plus Net

Recursive Visual Sound Separation Using Minus-Plus Net

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Self-Supervised Moving Vehicle Tracking With Stereo Sound

Contact Info

Product

Resources

About