Flow-Based Unconstrained Lip to Speech Generation

He, Jinzheng; Zhao, Zhou; Ren, Yi; Liu, Jinglin; Huai, Baoxing; Yuan, Nicholas Jing

doi:10.1609/aaai.v36i1.19966

Cited by 10 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Griffin-Lim algorithm [11]). These works outperform previous works by a wide margin under unconstrained settings [15,24].…”

Section: Introductionmentioning

confidence: 79%

“…As this topic has only recently attracted attention of the researchers, there are not many works on it currently. Prajwal et al [24] firstly propose an autoregressive sequence-to-sequence model modified from Tacotron 2 [31] to tackle this problem, which generates mel-spectrograms conditioned on video frames; He et al [15] use a non-autoregressive architecture to accelerate inference and use a Glow [19] module for mel-spectrogram refinement.…”

Section: Unconstrained Lip-to-speech Synthesismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Current works [15,24] show that the sequence-to-sequence [4,32] architecture is an effective solution to this problem. These works combine a visual encoder consisting of a stack of 3D-CNNs [35] and an LSTM [28] with acoustic models from TTS models [25,26,31], either adopting an autoregressive architecture [24] or a flow-based non-autoregressive architecture [15]. They use a twostage pipeline that first generates mel-spectrograms as intermediate representations, and then synthesizes audio waveforms from the spectrograms with a signal-processing-based algorithm (i.e.…”

Section: Introductionmentioning

confidence: 99%

“…In addition, both the autoregressive model [24] and the flowbased non-autoregressive model [15] suffer from inefficiencies in either inference time or memory usage due to their model characteristics. The autoregressive model suffers from high inference latency due to its recursive nature [12,13,26], while the flow-based model has a large amount of parameters which causes high memory occupancy (see Section 5.6).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Wang,

Zhao

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequenceto-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves 19.76× speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality. CCS CONCEPTS• Computing methodologies → Natural language generation; Activity recognition and understanding.

show abstract

“…Griffin-Lim algorithm [11]). These works outperform previous works by a wide margin under unconstrained settings [15,24].…”

Section: Introductionmentioning

confidence: 79%