Ruohan Gao scite author profile

Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene. However, binaural recordings are scarcely available and require nontrivial expertise and equipment to obtain. We propose to convert common monaural audio into binaural audio by leveraging video. The key idea is that visual frames reveal significant spatial cues that, while explicitly lacking in the accompanying single-channel audio, are strongly linked to it. Our multi-modal approach recovers this link from unlabeled video. We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations. We call the resulting output 2.5D visual sound-the visual stream helps "lift" the flat single channel audio into spatialized sound. In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation. Our video results: http://vision.cs. utexas.edu/projects/2.5D_visual_sound/ In

show abstract

Learning to Separate Object Sounds by Watching Unlabeled Video

Gao

Feris²,

Grauman

2018

248

220

View full text Add to dashboard Cite

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video. We obtain state-of-the-art results on visuallyaided audio source separation and audio denoising. Our video results:

show abstract

Co-Separating Sounds of Visual Objects

2019

View full text Add to dashboard Cite

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel training objective requires that the deep neural network's separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate videolevel audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visuallyguided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.where M V 1 and M V 2 are the ground-truth spectrogram ratio masks for the two videos, respectively. Namely,

show abstract

Listen to Look: Action Recognition by Previewing Audio

et al. 2020

View full text Add to dashboard Cite

Im2Flow: Motion Hallucination from Static Images for Action Recognition

2018

View full text Add to dashboard Cite

Existing methods to recognize actions in static images take the images at their face value, learning the appearances-objects, scenes, and body poses-that distinguish each action class. However, such models are deprived of the rich dynamic structure and motions that also define human activity. We propose an approach that hallucinates the unobserved future motion implied by a single snapshot to help static-image action recognition. The key idea is to learn a prior over short-term dynamics from thousands of unlabeled videos, infer the anticipated optical flow on novel static images, and then train discriminative models that exploit both streams of information. Our main contributions are twofold. First, we devise an encoder-decoder convolutional neural network and a novel optical flow encoding that can translate a static image into an accurate flow map. Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition. On seven datasets, we demonstrate the power of the approach. It not only achieves state-of-the-art accuracy for dense optical flow prediction, but also consistently enhances recognition of actions and dynamic scenes.

show abstract

On-demand Learning for Deep Image Restoration

Gao

Grauman

2017

View full text Add to dashboard Cite

While machine learning approaches to image restoration offer great promise, current methods risk training models fixated on performing well only for image corruption of a particular level of difficulty-such as a certain level of noise or blur. First, we examine the weakness of conventional "fixated" models and demonstrate that training general models to handle arbitrary levels of corruption is indeed non-trivial. Then, we propose an on-demand learning algorithm for training image restoration models with deep convolutional neural networks. The main idea is to exploit a feedback mechanism to self-generate training instances where they are needed most, thereby learning models that can generalize across difficulty levels. On four restoration tasks-image inpainting, pixel interpolation, image deblurring, and image denoising-and three diverse datasets, our approach consistently outperforms both the status quo training procedure and curriculum learning alternatives.

show abstract

VisualEchoes: Spatial Image Representation Learning Through Echolocation

Gao

Chen

Al-Halah

et al. 2020

View full text Add to dashboard Cite

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Gao

Grauman

2021

104

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ruohan Gao

2.5D Visual Sound

Learning to Separate Object Sounds by Watching Unlabeled Video

Co-Separating Sounds of Visual Objects

Listen to Look: Action Recognition by Previewing Audio

Im2Flow: Motion Hallucination from Static Images for Action Recognition

On-demand Learning for Deep Image Restoration

VisualEchoes: Spatial Image Representation Learning Through Echolocation

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Contact Info

Product

Resources

About