2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00398
|View full text |Cite
|
Sign up to set email alerts
|

Co-Separating Sounds of Visual Objects

Abstract: Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel train… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
236
1

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 185 publications
(238 citation statements)
references
References 41 publications
1
236
1
Order By: Relevance
“…Ephrat et al [12] and Owens et al [38] proposed to used vision to improve the quality of speech separation. Xu et al [53] and Gao et al [19] further improved the models with recursive models and co-separation loss. Those works all demonstrated how semantic appearances could help with sound separation.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Ephrat et al [12] and Owens et al [38] proposed to used vision to improve the quality of speech separation. Xu et al [53] and Gao et al [19] further improved the models with recursive models and co-separation loss. Those works all demonstrated how semantic appearances could help with sound separation.…”
Section: Related Workmentioning
confidence: 99%
“…• NMF [52] is a well established pipeline for audio-only source separation based on matrix factorization; • Deep Separation [9] is a CNN-based audio-only source separation approach; • MIML [17] is a model that combines NMF decomposition and multi-instance multi-label learning; • Sound of Pixels [57] is a pioneering work that uses vision for sound source separations; • Co-separation [19] devices a new model that incorporates an object-level co-separation loss into the mixand-separate framework [57]; • Sound of Motions [56] is a recently proposed selfsupervised model which leverages trajectory motion cues. We adopt the blind separation metrics, including signalto-distortion ratio (SDR), and signal-to-interference ratio (SIR) to quantitatively compare the quality of the sound separation.…”
Section: Hetero-musical Separationmentioning
confidence: 99%
See 1 more Smart Citation
“…Gao et al [11] proposed a model to detect each musical instrument in a video clip of multiple sounds and divide the sound emitted from each instrument. Gan et al [10] improved the performance of time-frequency mask estimation for sound source separation using a context-aware graph network to extract information from the time series of the performer key points.…”
Section: B Sound Separationmentioning
confidence: 99%
“…It is instinctive to leverage the audio-visual synchronization as free supervision for training a neural network, which can recognize the objects that make sounds. Some works employ the learned self-supervised features to separate sound mixtures, including musical instrument sound separation [19,20,21,22] and speech separation [23], which do not need object/face/lip detection. Although these approaches show great promise that self-supervised learning assists to focus on the visual regions of sound sources, they may still fail to provide an acceptable solution in complicated auditory scenarios, e.g., the number of sound objects is uncertain and even dynamic.…”
Section: Introductionmentioning
confidence: 99%