Improved Speech Reconstruction from Silent Video

Ephrat, Ariel; Halperin, Tavi; Peleg, Bezalel

doi:10.1109/iccvw.2017.61

Cited by 81 publications

(81 citation statements)

References 43 publications

(69 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [25] a deep neural network is developed to generate speech from silent video frames of a speaking person. This model is used in [26] for speech enhancement, where the predicted spectrogram serves as a mask to filter the noisy speech.…”

Section: Arxiv:180404121v2 [Cscv] 19 Jun 2018mentioning

confidence: 99%

The Conversation: Deep Audio-Visual Speech Enhancement

Afouras¹,

Chung²,

Zisserman³

2018

Interspeech 2018

301

332

View full text Add to dashboard Cite

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

show abstract

Section: Arxiv:180404121v2 [Cscv] 19 Jun 2018mentioning

confidence: 99%

The Conversation: Deep Audio-Visual Speech Enhancement

Afouras¹,

Chung²,

Zisserman³

2018

Interspeech 2018

301

332

View full text Add to dashboard Cite

show abstract

“…Several approaches exist for generation of intelligible speech from silent video frames of a person speaking [5,6,7]. In this work we rely on vid2speech [6], briefly described in Sec. 2.1.…”

Section: Visually-derived Speech Generationmentioning

confidence: 99%

“…We continue with the isolation of the speech of a single visible speaker from background sounds. This work builds upon recent advances in machine speechreading, generating speech from visible motion of the face and mouth [5,6,7].…”

Section: Introductionmentioning

confidence: 99%

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

Gabbay

Ephrat

Halperin

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.

show abstract

“…The LSPs are converted into waveforms but since excitation is not predicted the resulting speech sounds unnatural. This method is extended in [9] by adding optical flow information as input to the network and by adding a postprocessing step, where generated sound features are replaced by their closest match from the training set. A similar method that uses multi-view visual feeds has been proposed in [10].…”

Section: Introductionmentioning

confidence: 99%

Video-Driven Speech Reconstruction Using Generative Adversarial Networks

Vougioukas

Petridis

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capable of producing natural sounding, intelligible speech which is synchronised with the video. The performance of our model is evaluated on the GRID dataset for both speaker dependent and speaker independent scenarios. To the best of our knowledge this is the first method that maps video directly to raw audio and the first to produce intelligible speech when tested on previously unseen speakers. We evaluate the synthesised audio not only based on the sound quality but also on the accuracy of the spoken words.

show abstract

Improved Speech Reconstruction from Silent Video

Cited by 81 publications

References 43 publications

The Conversation: Deep Audio-Visual Speech Enhancement

The Conversation: Deep Audio-Visual Speech Enhancement

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

Video-Driven Speech Reconstruction Using Generative Adversarial Networks

Contact Info

Product

Resources

About