2018
DOI: 10.1007/978-3-030-01246-5_35
|View full text |Cite
|
Sign up to set email alerts
|

The Sound of Pixels

Abstract: We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
565
0
2

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 395 publications
(588 citation statements)
references
References 45 publications
1
565
0
2
Order By: Relevance
“…Existing works [10,28] on visual sound separation mainly separate each sound independently. They assume either fixed types or fixed numbers of sounds, separating sounds independently.…”
Section: Introductionmentioning
confidence: 99%
“…Existing works [10,28] on visual sound separation mainly separate each sound independently. They assume either fixed types or fixed numbers of sounds, separating sounds independently.…”
Section: Introductionmentioning
confidence: 99%
“…With new advances in deep learning, this task has attracted more attention [2], [3], [5], [8], [9], [14]. Since the approach we propose recently [2], some interesting methods on the sound source localization task have been developed.…”
Section: Related Work and Problem Contextmentioning
confidence: 99%
“…Furthermore, our networks have an attention layer that interacts between the two modalities and reveals the localization information of the sound source. In [5], Zhao et al also explore the sound source localization in the musical instruments domain. On the other hand, several methods [8], [18] are designed to localize actions in videos, rather than objects in static images with an unsupervised learning method.…”
Section: Related Work and Problem Contextmentioning
confidence: 99%
“…[18,26] also explored how to generate spatial sound for videos. More recently, [40,39,13,17,32] used the visual-audio correspondence to separate sound sources. In contrast to previous work that has only transferred class-level information between modalities, this work transfers richer, region-level location information about objects.…”
Section: Cross-modal Self-supervised Learningmentioning
confidence: 99%