Vocoder-Based Speech Synthesis from Silent Videos

Michelsanti, Daniel; Slizovskaia, Olga; Haro, Gloria; Jensen, Jesper

doi:10.48550/arxiv.2004.02541

Cited by 8 publications

(14 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yadav et al [11] used stochastic modelling approach with variational autoencoder. Michelsanti et al [12] predicted vocoder features of [13] and synthesized speech using the vocoder. Different from the previous works, our approach explicitly models the local visual feature and global visual context to synthesize accurate speech.…”

Section: Related Workmentioning

confidence: 99%

Lip to Speech Synthesis with Visual Context Attentional GAN

Kim¹,

Hong²,

Ro³

2022

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing stateof-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.

show abstract

Section: Related Workmentioning

confidence: 99%

Lip to Speech Synthesis with Visual Context Attentional GAN

Kim¹,

Hong²,

Ro³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Afterwards, Prajwal et al [36] improve the model performance with 3D CNN and skip connections. Recently, Michelsanti et al [37] have presented a multi-task architecture to learn spectral envelope, aperiodic parameters and fundamental frequency separately, which are then fed into a vocoder for waveform synthesis. They integrate a connectionist temporal classification (CTC) [38] loss to jointly perform lip reading, which is capable of further enhancing and constraining the video encoder.…”

Section: A Lip To Speech Reconstructionmentioning

confidence: 99%

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Qu,

Weber,

Wermter

2021

Preprint

View full text Add to dashboard Cite

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and locationaware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on ∼2400h multi-lingual (e.g. English and German) audiovisual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets.

show abstract

“…The main difference between voice cloning and speech synthesis is that the former puts an emphasis on the identity of the target speaker [25], while the latter sometimes disregards this aspect for naturalness [26]. Given this definition, a voice cloning can be a TTS, a VC, or any type of speech synthesis system [4], [5]. The NAUTILUS system is designed to be expandable to other input interfaces.…”

Section: Related Work On Voice Cloning a Definition Of Voice Cloningmentioning

confidence: 99%

“…In this work, we treat our system as a whole, instead of focusing on individual techniques, and we compare it with other third-party systems. For objective evaluation, we used an ASR model 4 to calculate the word error rate (WER) of generated speech. Note that the WER was only used as a reference point since it is highly sensitive to the training data of the ASR model.…”

Section: Evaluation Measurementsmentioning

confidence: 99%

“…In its narrow sense, speech synthesis is used to refer to text-to-speech (TTS) systems [1], which play an essential role in a spoken dialog system as a way for machine-human communication. In its broader definition, speech synthesis can refer to all kinds of speech generation interfaces like voice conversion (VC) [2], video-to-speech [3], [4], and others [5]. Recent state-of-the-art speech synthesis systems can generate speech with natural sounding quality, some of which is indistinguishable from recorded speech [6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NAUTILUS: A Versatile Voice Cloning System

Luong

Yamagishi

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.

show abstract

Vocoder-Based Speech Synthesis from Silent Videos

Cited by 8 publications

References 0 publications

Lip to Speech Synthesis with Visual Context Attentional GAN

Lip to Speech Synthesis with Visual Context Attentional GAN

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

NAUTILUS: A Versatile Voice Cloning System

Contact Info

Product

Resources

About