ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682467
|View full text |Cite
|
Sign up to set email alerts
|

Self-supervised Audio-visual Co-segmentation

Abstract: Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [1]. Here, we introduce a learning approach to disentangle concepts in the neural networks, and a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
64
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 107 publications
(73 citation statements)
references
References 27 publications
0
64
0
Order By: Relevance
“…Humans commonly make subconscious predictions about outcomes in the physical world, and are surprised by the unexpected. Self-supervised learning, in which the goal of learning is to predict the future output from other data streams is a promising direction (34). Imitation learning is also a powerful way learn important behaviors and gain knowledge about the world (35).…”
Section: Origins Of Deep Learning I Have Written a Book The Deep Lementioning
confidence: 99%
“…Humans commonly make subconscious predictions about outcomes in the physical world, and are surprised by the unexpected. Self-supervised learning, in which the goal of learning is to predict the future output from other data streams is a promising direction (34). Imitation learning is also a powerful way learn important behaviors and gain knowledge about the world (35).…”
Section: Origins Of Deep Learning I Have Written a Book The Deep Lementioning
confidence: 99%
“…Another interesting problem is sounding object localization, where the goal is to associate sounds in the visual input spatially [26,25,3,44,57]. Some other interesting directions include biometric matching [37], sound generation for videos [58], auditory vehicle tracking [16], emotion recognition [1], audio-visual co-segmentation [43], audio-visual navigation [15], and 360/stereo sound from videos [18,35].…”
Section: Related Workmentioning
confidence: 99%
“…Follow up works [2,33] further investigated to jointly learn the visual and audio representation using a visual-audio correspondence task. Instead of learning feature representations, recent works have also explored to localize sound source in images or videos [29,26,3,48,64], biometric matching [39], visual-guided sound source separation [64,15,19,60], auditory vehicle tracking [18], multi-modal action recognition [36,35,21], audio inpainting [66], emotion recognition [1], audio-visual event localization [56], multi-modal physical scene understanding [16], audio-visual co-segmentation [47], aerial scene recognition [27] and audio-visual embodied navigation [17].…”
Section: Audio-visual Learningmentioning
confidence: 99%