Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Zhang, Jingxuan; Ling, Zhen-Hua; Dai, Li-Rong

doi:10.21437/interspeech.2020-36

Cited by 3 publications

(2 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Emovox has a recognition-synthesis structure similar to that of [56], [119]. The Seq2Seq recognition encoder consists of an encoder which is a 2-layer 256-cell BLSTM, and a decoder which is a 1-layer 512-cell LSTM with an attention layer followed by an FC layer with an output channel of 512.…”

Section: Recognition-synthesis Structurementioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou

Şişman

Rana

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.

show abstract

Section: Recognition-synthesis Structurementioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou

Şişman

Rana

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…Our proposed framework can be regarded as a sequencelevel recognition-synthesis structure similar to that of [102], [111]. Both the linguistic encoder and the decoder have a sequence-to-sequence encoder-decoder structure.…”

Section: Network Configurationmentioning

confidence: 99%

Speech Synthesis With Mixed Emotions

Zhou

Şişman

Rana

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing and evaluating mixed emotions in speech.

show abstract