Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-307
|View full text |Cite
|
Sign up to set email alerts
|

Emotional Prosody Control for Speech Generation

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…Some models feed prosody features with phoneme embeddings directly into the decoder while others use them to predict intermediate features that are used in conditioning the decoder. It is empirically verified (like in Sivaprasad et al, 2021) that intermediate features could be suitably manipulated to bring about the desired change in expression.…”
Section: Introductionmentioning
confidence: 86%
See 3 more Smart Citations
“…Some models feed prosody features with phoneme embeddings directly into the decoder while others use them to predict intermediate features that are used in conditioning the decoder. It is empirically verified (like in Sivaprasad et al, 2021) that intermediate features could be suitably manipulated to bring about the desired change in expression.…”
Section: Introductionmentioning
confidence: 86%
“…Furthermore, we observe performance drop on Fastspeech2π + EVA when compared against Fastspeech2π + DS when both have their backbones trained on Blizzard dataset (Table 1). The lack of improvement from (Sivaprasad et al, 2021) further highlights that the performance gains by our model does not come from the choice of dataset on which the backbone is trained. Overall, the two experiments conclusively show that DS module is the decisive component that brings the improvements in naturalness and controllability to the proposed TTS system.…”
Section: Comparing With Prior Artmentioning
confidence: 99%
See 2 more Smart Citations
“…We assume that the emotional speeches of multiple speakers with the emotion label are given in the training stage, but only the neutral speech of the target speaker is available in the inference stage. In the context of training, we empirically find that a naive application of the existing emotion control methods [9,15] is ineffective since the emotion feature and speaker identity are highly entangled in the style vector used by the style-based generator (See Figure 3 for qualitative analysis). In order to overcome this limitation, we use domain adversarial training [16] entangle the emotional content from the style vector and make the style-based generator solely pay attention to the specified emotion condition.…”
Section: Introductionmentioning
confidence: 99%