Emotional Voice Conversion Using Multitask Learning with Text-To-Speech

Kim, Sang Woo; Cho, Sungjae; Choi, Seokjin; Park, Sejik; Lee, Sang Yup

doi:10.1109/icassp40776.2020.9053255

Cited by 22 publications

(24 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are only few studies on sequence-to-sequence emotional voice conversion [20], [42], [43], [59]. In [42], the authors jointly model pitch and duration with parallel data, where the model is conditioned on the syllable position in the phrase.…”

Section: Sequence-to-sequence Emotional Voice Conversionmentioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou,

Sisman,

Rana

et al. 2022

Preprint

View full text Add to dashboard Cite

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.

show abstract

Section: Sequence-to-sequence Emotional Voice Conversionmentioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou,

Sisman,

Rana

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Inspired by the success in speaker voice conversion, multi-task learning between emotional voice conversion and text-to-speech is studied [52]. In this framework, a single sequence-tosequence model is trained to optimize both VC and TTS, in which the VC system benefits from the latent phonetic representation learnt by TTS during the training.…”

Section: Leveraging Tts or Asr Systemsmentioning

confidence: 99%

“…The recent studies on deep learning have seen remarkable performance, such as DNN [16,39,40], highway neural network [41], deep bi-directional long-short-term memory network (DBLSTM) [42], and sequence-to-sequence model [43,44]. Beyond parallel training data, new techniques have been proposed to learn the translation between emotional domains with CycleGAN [45,46] and StarGAN [47], to disentangle the emotional elements from speech with auto-encoders [48,49,50,51], and to leverage text-to-speech (TTS) [52,53] or automatic speech recognition (ASR) [54]. Such framework generally works well in speaker-dependent tasks.…”

Section: Introductionmentioning

confidence: 99%

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases. We then motivate the development of a novel emotional speech database (ESD) that addresses the increasing research need. With this paper, the ESD database 1 is now made available to the research community. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies. As case studies, we implement several state-of-the-art emotional voice conversion systems on the ESD database. This paper provides a reference study on ESD in conjunction with its release.

show abstract

“…Seeing the drawbacks of the two methods discussed above, a third method, concatenative speech synthesis was proposed in [3,13], which overcomes the limitation of naturalness by concatenating human speech's pre-recorded units. HMM is also one of the methods to synthesize speech as in [13] which uses a hybrid method of HMM. Apart from these methods like, LSTM [1], CNN [2,3,14], RNN [1], and Bi-LSTMs [15] have also been used for synthesizing speech.…”

Section: Speech Synthesis From Textmentioning

confidence: 99%

Prosodic Speech Synthesis of Narratives Depicting Emotional Diversity Using Deep Learning

Shah¹,

Gupta²,

Jardosh³

et al. 2021

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Emotions are an essential part of speech or communication, which is why they cannot be neglected. The existing text-to-speech systems are not the most appropriate at conveying the emotions present behind the text. The systems can speak out the text monotonically lacking expressiveness. In this paper, an Expressive Textto-Speech Synthesis System (ETSSS) is proposed which considers the dominant emotions in the text provided. ETSSS works in two parts: first, it identifies the label behind the text, and second produces expressive speech. In the first part, the input text is given an emotional label. Later, this label is used to generate expressive and prosodic speech. Labeling emotions in ETSSS is carried out using BERT which has an accuracy of 94%, 90%, and 90% for disgust, amused, and anger, respectively. The speech synthesis with the emotion module of ETSSS achieves a good MOS of 3.8 for anger, 3.5 for disgust, and 3.2 for amused. IntroductionGenerating speech from text has been used for the past decade. It is important to note that emotions in speech play an important role. The three most common aspects of speech include intelligence, naturalness, and expressiveness. Prosody is defined as

show abstract

Emotional Voice Conversion Using Multitask Learning with Text-To-Speech

Cited by 22 publications

References 20 publications

Emotion Intensity and its Control for Emotional Voice Conversion

Emotion Intensity and its Control for Emotional Voice Conversion

Emotional Voice Conversion: Theory, Databases and ESD

Prosodic Speech Synthesis of Narratives Depicting Emotional Diversity Using Deep Learning

Contact Info

Product

Resources

About