2019
DOI: 10.48550/arxiv.1906.06301
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
26
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(27 citation statements)
references
References 0 publications
1
26
0
Order By: Relevance
“…2) Speaker-independent Result: for speaker-independent cases, we follow the same setups for GRID [34] and TCD-TIMIT [32].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…2) Speaker-independent Result: for speaker-independent cases, we follow the same setups for GRID [34] and TCD-TIMIT [32].…”
Section: Resultsmentioning
confidence: 99%
“…Kumar et al [33] validate the effectiveness of using multiple views of faces on both speaker-dependent and -independent speech reconstruction. Vougioukas et al [34] utilize generative adversarial networks (GAN) to directly predict raw waveforms from visual inputs in an end-to-end fashion without generating an intermediate representation of audios. Inspired by the speech synthesis model, Tacotron2 [35], Qu et al propose to directly map video inputs to low-level speech representations, mel-spectrogram, with an encoder-decoder architecture and achieve better results on lip reading experiments.…”
Section: A Lip To Speech Reconstructionmentioning
confidence: 99%
“…While this can be effective for the GRID corpus that has no head movements, optical flow could be a detrimental feature in unconstrained settings due to large head pose changes. Another work [36] strives for improved speech quality by generating raw waveforms using GANs. However, both these works do not make use of the well-studied sequence-to-sequence paradigm [31] that is used for text-to-speech generation [30]; thus leaving a large room for improvement in speech quality and correctness.…”
Section: Lip To Speech Generationmentioning
confidence: 99%
“…Prior works in lip to speech regard their speech representation as a 2D-image [10,36] in the case of melspectrograms or as a single feature vector [10] in the case of LPC features. In both these cases, they use a 2D-CNN to decode these speech representations.…”
Section: Problem Formulationmentioning
confidence: 99%
See 1 more Smart Citation