2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462527
|View full text |Cite
|
Sign up to set email alerts
|

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

Abstract: Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This ap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
71
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 79 publications
(75 citation statements)
references
References 23 publications
1
71
0
1
Order By: Relevance
“…Concurrent work Concurrently and independently from us, a number of groups have proposed closely related methods for source separation and sound localization. Gabbay et al [44,45] use a vision-to-sound method to separate speech, and propose a convolutional separation model. Unlike our work, they assume speaker identities are known.…”
Section: Audio-visual Source Separationmentioning
confidence: 99%
“…Concurrent work Concurrently and independently from us, a number of groups have proposed closely related methods for source separation and sound localization. Gabbay et al [44,45] use a vision-to-sound method to separate speech, and propose a convolutional separation model. Unlike our work, they assume speaker identities are known.…”
Section: Audio-visual Source Separationmentioning
confidence: 99%
“…In [25] a deep neural network is developed to generate speech from silent video frames of a speaking person. This model is used in [26] for speech enhancement, where the predicted spectrogram serves as a mask to filter the noisy speech. However, the noisy audio signal is not used in the pipeline, and the network is not trained for the task of speech enhancement.…”
Section: Arxiv:180404121v2 [Cscv] 19 Jun 2018mentioning
confidence: 99%
“…[15] used it for speech denoising; Chung et al[9] demonstrated lip reading from face videos. Ephrat et al [12] and Owens et al [34] demonstrated speech separation and enhancement from videos.…”
mentioning
confidence: 99%