ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053436
|View full text |Cite
|
Sign up to set email alerts
|

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech, with dramatic prosodic variation between tokens. This paper proposes a sequential prior in a discrete latent space… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
75
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 73 publications
(75 citation statements)
references
References 16 publications
0
75
0
Order By: Relevance
“…VQ-VAE [14] has been applied to various speech synthesis tasks, including diverse and controllable TTS [17,18], a new TTS framework based on symbol-to-symbol translation [19], speech coding [20], voice conversion [21], and representation learning [22,23,24,25].…”
Section: Vector Quantized Autoencoder For Speech Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…VQ-VAE [14] has been applied to various speech synthesis tasks, including diverse and controllable TTS [17,18], a new TTS framework based on symbol-to-symbol translation [19], speech coding [20], voice conversion [21], and representation learning [22,23,24,25].…”
Section: Vector Quantized Autoencoder For Speech Tasksmentioning
confidence: 99%
“…Among the related methods, Sun et al applied conditional VQ-VAE to TTS, although their objective was diverse TTS for data augmentation rather than duration modeling [18]. Their method relies on soft-attention to align speech and phoneme.…”
Section: Vector Quantized Autoencoder For Speech Tasksmentioning
confidence: 99%
“…Another idea [33]- [37] is to extract the latent prosody embeddings to characterize prosody. Some [33]- [35] learn speech variations without explicit annotations for prosody or style.…”
Section: Tacotron-based Ttsmentioning
confidence: 99%
“…The learned prosody embeddings are usually not fully controllable and interpretable. Others [36], [37] just take the prosody embeddings as an auxiliary input to the TTS model.…”
Section: Tacotron-based Ttsmentioning
confidence: 99%
“…These systems are trained with well-articulated read speech. Attempts to model speech with various speaking styles have also been actively investigated in deep-learning-based speech synthesis studies [3]- [10].…”
Section: Introductionmentioning
confidence: 99%