Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1445
|View full text |Cite
|
Sign up to set email alerts
|

Video-Driven Speech Reconstruction Using Generative Adversarial Networks

Abstract: Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capable of producing natural sounding, intelligible speech which is synchronised with the video. The performance of our … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
47
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 37 publications
(48 citation statements)
references
References 29 publications
1
47
0
Order By: Relevance
“…Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients.…”
Section: A Speech Reconstruction From Silent Videosmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients.…”
Section: A Speech Reconstruction From Silent Videosmentioning
confidence: 99%
“…The method proposed in [177] intended to still be able to reconstruct speech in a speaker independent scenario, but also to avoid artefacts similar to the ones introduced by the model in [256]. Therefore, vocoder features were used as training target instead of raw waveforms.…”
Section: A Speech Reconstruction From Silent Videosmentioning
confidence: 99%
“…The encoded features are then passed to a pre-trained autoencoder network to predict the audio features. Using an adversarial training approach, in [9], the author proposed using a discriminator network (called "critic") that distinguishes real audio from the generated audio. However, the model uses a pre-trained speech-to-video network [10] as a reference model to minimize the perceptual loss between the features obtained from real and generated audios.…”
Section: Related Workmentioning
confidence: 99%
“…Besides, sound and image can be converted to each other in [9]. Speech is reconstructed from facial videos in [40]. Talking face generation [23,42,45,49,51] is a typical example of audio-to-video generation.…”
Section: Multi-modal Generationmentioning
confidence: 99%