2020
DOI: 10.48550/arxiv.2004.02541
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vocoder-Based Speech Synthesis from Silent Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(14 citation statements)
references
References 0 publications
1
13
0
Order By: Relevance
“…Yadav et al [11] used stochastic modelling approach with variational autoencoder. Michelsanti et al [12] predicted vocoder features of [13] and synthesized speech using the vocoder. Different from the previous works, our approach explicitly models the local visual feature and global visual context to synthesize accurate speech.…”
Section: Related Workmentioning
confidence: 99%
“…Yadav et al [11] used stochastic modelling approach with variational autoencoder. Michelsanti et al [12] predicted vocoder features of [13] and synthesized speech using the vocoder. Different from the previous works, our approach explicitly models the local visual feature and global visual context to synthesize accurate speech.…”
Section: Related Workmentioning
confidence: 99%
“…Afterwards, Prajwal et al [36] improve the model performance with 3D CNN and skip connections. Recently, Michelsanti et al [37] have presented a multi-task architecture to learn spectral envelope, aperiodic parameters and fundamental frequency separately, which are then fed into a vocoder for waveform synthesis. They integrate a connectionist temporal classification (CTC) [38] loss to jointly perform lip reading, which is capable of further enhancing and constraining the video encoder.…”
Section: A Lip To Speech Reconstructionmentioning
confidence: 99%
“…The main difference between voice cloning and speech synthesis is that the former puts an emphasis on the identity of the target speaker [25], while the latter sometimes disregards this aspect for naturalness [26]. Given this definition, a voice cloning can be a TTS, a VC, or any type of speech synthesis system [4], [5]. The NAUTILUS system is designed to be expandable to other input interfaces.…”
Section: Related Work On Voice Cloning a Definition Of Voice Cloningmentioning
confidence: 99%
“…In this work, we treat our system as a whole, instead of focusing on individual techniques, and we compare it with other third-party systems. For objective evaluation, we used an ASR model 4 to calculate the word error rate (WER) of generated speech. Note that the WER was only used as a reference point since it is highly sensitive to the training data of the ASR model.…”
Section: Evaluation Measurementsmentioning
confidence: 99%
See 1 more Smart Citation