Self-supervised Audio-visual Co-segmentation

Rouditchenko, Andrew; Zhao, Hang; Gan, Chuang; McDermott, Josh H.; Torralba, Antonio

doi:10.1109/icassp.2019.8682467

Cited by 107 publications

(73 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Humans commonly make subconscious predictions about outcomes in the physical world, and are surprised by the unexpected. Self-supervised learning, in which the goal of learning is to predict the future output from other data streams is a promising direction (34). Imitation learning is also a powerful way learn important behaviors and gain knowledge about the world (35).…”

Section: Origins Of Deep Learning I Have Written a Book The Deep Lementioning

confidence: 99%

The unreasonable effectiveness of deep learning in artificial intelligence

Sejnowski

2020

Proc. Natl. Acad. Sci. U.S.A.

274

141

View full text Add to dashboard Cite

Deep learning networks have been trained to recognize speech, caption photographs and translate text between languages at high levels of performance. Although applications of deep learning networks to real world problems have become ubiquitous, our understanding of why they are so effective is lacking. These empirical results should not be possible according to sample complexity in statistics and non-convex optimization theory. However, paradoxes in the training and effectiveness of deep learning networks are being investigated and insights are being found in the geometry of high-dimensional spaces. A mathematical theory of deep learning would illuminate how they function, allow us to assess the strengths and weaknesses of different network architectures and lead to major improvements. Deep learning has provided natural ways for humans to communicate with digital devices and is foundational for building artificial general intelligence. Deep learning was inspired by the architecture of the cerebral cortex and insights into autonomy and general intelligence may be found in other brain regions that are essential for planning and survival, but major breakthroughs will be needed to achieve these goals.

show abstract

Section: Origins Of Deep Learning I Have Written a Book The Deep Lementioning

confidence: 99%

The unreasonable effectiveness of deep learning in artificial intelligence

Sejnowski

2020

Proc. Natl. Acad. Sci. U.S.A.

274

141

View full text Add to dashboard Cite

show abstract

“…Another interesting problem is sounding object localization, where the goal is to associate sounds in the visual input spatially [26,25,3,44,57]. Some other interesting directions include biometric matching [37], sound generation for videos [58], auditory vehicle tracking [16], emotion recognition [1], audio-visual co-segmentation [43], audio-visual navigation [15], and 360/stereo sound from videos [18,35].…”

Section: Related Workmentioning

confidence: 99%

Music Gesture for Visual Sound Separation

Gan¹,

Huang

Zhao

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

185

132

View full text Add to dashboard Cite

show abstract

“…Follow up works [2,33] further investigated to jointly learn the visual and audio representation using a visual-audio correspondence task. Instead of learning feature representations, recent works have also explored to localize sound source in images or videos [29,26,3,48,64], biometric matching [39], visual-guided sound source separation [64,15,19,60], auditory vehicle tracking [18], multi-modal action recognition [36,35,21], audio inpainting [66], emotion recognition [1], audio-visual event localization [56], multi-modal physical scene understanding [16], audio-visual co-segmentation [47], aerial scene recognition [27] and audio-visual embodied navigation [17].…”

Section: Audio-visual Learningmentioning

confidence: 99%

Foley Music: Learning to Generate Music from Videos

Gan

Huang

Chen

et al. 2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results.

show abstract

Self-supervised Audio-visual Co-segmentation

Cited by 107 publications

References 27 publications

The unreasonable effectiveness of deep learning in artificial intelligence

The unreasonable effectiveness of deep learning in artificial intelligence

Music Gesture for Visual Sound Separation

Foley Music: Learning to Generate Music from Videos

Contact Info

Product

Resources

About