Emotional Prosody Control for Speech Generation

Sivaprasad, Sarath; Kosgi, Saiteja; Gandhi, Vineet

doi:10.21437/interspeech.2021-307

Cited by 7 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some models feed prosody features with phoneme embeddings directly into the decoder while others use them to predict intermediate features that are used in conditioning the decoder. It is empirically verified (like in Sivaprasad et al, 2021) that intermediate features could be suitably manipulated to bring about the desired change in expression.…”

Section: Introductionmentioning

confidence: 86%

“…Furthermore, we observe performance drop on Fastspeech2π + EVA when compared against Fastspeech2π + DS when both have their backbones trained on Blizzard dataset (Table 1). The lack of improvement from (Sivaprasad et al, 2021) further highlights that the performance gains by our model does not come from the choice of dataset on which the backbone is trained. Overall, the two experiments conclusively show that DS module is the decisive component that brings the improvements in naturalness and controllability to the proposed TTS system.…”

Section: Comparing With Prior Artmentioning

confidence: 99%

“…Our work focuses on the same level of control but specifically over the affective state as labeled in some data for supervision. Sivaprasad et al (2021) propose a model similar to Wang et al (2018) with style tokens restricted to valence and arousal. However, the absolute (pitch, energy, duration) feature predictions restrict prosody control, leading to unnatural distortions.…”

Section: Related Workmentioning

confidence: 99%

“…Some efforts towards improving expressiveness (like Karlapati et al, 2020) provide prosody control using a reference clip. Others like Sivaprasad et al (2021) and Habib et al (2019) further focused on controllability exposing levers that can be manipulated at inference-time to derive the intended expression. However, the quality and stability of synthesized speech heavily depends on various modeling choices.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Kosgi¹,

Sivaprasad²,

Pedanekar³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue. Audio samples for our experiments and the code are available at: https: //emtts.github.io/tts-demo/

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Comparing With Prior Artmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Kosgi¹,

Sivaprasad²,

Pedanekar³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

show abstract

“…We assume that the emotional speeches of multiple speakers with the emotion label are given in the training stage, but only the neutral speech of the target speaker is available in the inference stage. In the context of training, we empirically find that a naive application of the existing emotion control methods [9,15] is ineffective since the emotion feature and speaker identity are highly entangled in the style vector used by the style-based generator (See Figure 3 for qualitative analysis). In order to overcome this limitation, we use domain adversarial training [16] entangle the emotional content from the style vector and make the style-based generator solely pay attention to the specified emotion condition.…”

Section: Introductionmentioning

confidence: 99%

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Tang¹,

Luo²,

Zhao³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET-Speech.github.io/ZET-Speech-Demo/.

show abstract