2018
DOI: 10.1007/978-3-030-01219-9_3
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Separate Object Sounds by Watching Unlabeled Video

Abstract: Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
220
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 249 publications
(221 citation statements)
references
References 84 publications
0
220
0
Order By: Relevance
“…The most related works are [28] and [10]. In [10], a convolutional network is used to predict the type of objects appeared in the video, and Non-negative Matrix Factorization [9] is used to extract a set of basic components. The association between each object and each basic component will be estimated via a Multi-Instance Multi-Label objective.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The most related works are [28] and [10]. In [10], a convolutional network is used to predict the type of objects appeared in the video, and Non-negative Matrix Factorization [9] is used to extract a set of basic components. The association between each object and each basic component will be estimated via a Multi-Instance Multi-Label objective.…”
Section: Related Workmentioning
confidence: 99%
“…This would result in a feature tensor of size T × (H/16) × (W/16) × k. In both training and testing, this feature tensor will be reduced into a vector to represent the visual content by performing max pooling along the first three dimensions. On top of this solo video collection, we then follow the Mix-and-Separate strategy as in [28,10] to construct the mixed video/sound data, where each sample mixes n videos, called a mix-n sample.…”
Section: Training and Testing Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…A related topic is generating speech by measuring vibrations in a video [14]. Follow up works include separating input audio signals into a set of components that corresponds to different objects in the given video [20], and separating audio corresponding to each pixel [46].…”
Section: Related Workmentioning
confidence: 99%
“…Xu et al (2017) employ AudioSet for weakly supervised audio event detection, whereas Jansen et al (2017) extract semantic representations from non-speech audio following an unsupervised approach. Since the dataset is intended for general purpose audio event classification, it is suitable for a variety of problems related to audio, such as music o video processing (Gao et al, 2018;Zhou et al, 2018).…”
Section: Audiosetmentioning
confidence: 99%