Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2571
|View full text |Cite
|
Sign up to set email alerts
|

Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-To-Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
34
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 60 publications
(34 citation statements)
references
References 0 publications
0
34
0
Order By: Relevance
“…2. The acoustic encoder in the CTC model consists of convolution layers followed by a bidirectional LSTM, which is similar to that in [28]. The LatentNet φ in the prior consists of a CBHG-based linguistic encoder [29] and an additional LSTM layer.…”
Section: Methodsmentioning
confidence: 99%
“…2. The acoustic encoder in the CTC model consists of convolution layers followed by a bidirectional LSTM, which is similar to that in [28]. The LatentNet φ in the prior consists of a CBHG-based linguistic encoder [29] and an additional LSTM layer.…”
Section: Methodsmentioning
confidence: 99%
“…A semi-supervised approach utilizing both Mel spectrograms and prosodic features as inputs to a variational framework is proposed in [16]. In a similar approach to ours [17], aggregated continuous prosodic features (F0, mgc0, duration) are used for fine-grained prosody transfer. We differentiate our work by introducing discrete representations for arbitrary prosody control, as well as a method for disentanglement of phonetic and prosodic content.…”
Section: Related Workmentioning
confidence: 99%
“…Though sentence-level latent representation shows the ability to capture prosodic features from speech [9,16], it is short of fine-grained controllability and robustness in inter-speaker transfer. Recently, a train of researches have turned to fine-grained prosody modeling [17][18][19][20]. [17] introduced temporal structures on both speech and text sides for prosody embedding, which enabled pitch and amplitude manipulation at frame and phoneme levels.…”
Section: Fine-grained Prosody Modelingmentioning
confidence: 99%
“…[17] introduced temporal structures on both speech and text sides for prosody embedding, which enabled pitch and amplitude manipulation at frame and phoneme levels. [18] pre-computed prosody-related acoustic features of phonemes, and used them to reproduce a reference prosody on synthesized speech. [19] and [20] used fine-grained VAEs to learn local latent features under different model specifications.…”
Section: Fine-grained Prosody Modelingmentioning
confidence: 99%