Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

Gabbay, Aviv; Ephrat, Ariel; Halperin, Tavi; Peleg, Bezalel

doi:10.1109/icassp.2018.8462527

Cited by 79 publications

(75 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Concurrent work Concurrently and independently from us, a number of groups have proposed closely related methods for source separation and sound localization. Gabbay et al [44,45] use a vision-to-sound method to separate speech, and propose a convolutional separation model. Unlike our work, they assume speaker identities are known.…”

Section: Audio-visual Source Separationmentioning

confidence: 99%

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Owens

Efros

2018

Lecture Notes in Computer Science

602

631

View full text Add to dashboard Cite

The thud of a bouncing ball, the onset of speech as lips open -when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/offscreen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage:

show abstract

Section: Audio-visual Source Separationmentioning

confidence: 99%

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Owens

Efros

2018

Lecture Notes in Computer Science

602

631

View full text Add to dashboard Cite

show abstract

“…In [25] a deep neural network is developed to generate speech from silent video frames of a speaking person. This model is used in [26] for speech enhancement, where the predicted spectrogram serves as a mask to filter the noisy speech. However, the noisy audio signal is not used in the pipeline, and the network is not trained for the task of speech enhancement.…”

Section: Arxiv:180404121v2 [Cscv] 19 Jun 2018mentioning

confidence: 99%

The Conversation: Deep Audio-Visual Speech Enhancement

Afouras¹,

Chung²,

Zisserman³

2018

Interspeech 2018

301

332

View full text Add to dashboard Cite

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

show abstract

“…[15] used it for speech denoising; Chung et al[9] demonstrated lip reading from face videos. Ephrat et al [12] and Owens et al [34] demonstrated speech separation and enhancement from videos.…”

mentioning

confidence: 99%

The Sound of Motions

Zhao

Gan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

240

View full text Add to dashboard Cite

Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curriculum learning scheme. It exploits the inherent coherence of audio-visual signals from a large quantities of unlabeled videos. Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, our motion based system improves performance in separating musical instrument sounds. Furthermore, it separates sound components from duets of the same category of instruments, a challenging problem that has not been addressed before.

show abstract

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

Cited by 79 publications

References 23 publications

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The Conversation: Deep Audio-Visual Speech Enhancement

The Sound of Motions

Contact Info

Product

Resources

About