Sequence-to-Sequence Emotional Voice Conversion With Strength Control

Choi, Heejin; Hahn, Minsoo

doi:10.1109/access.2021.3065460

Cited by 23 publications

(32 citation statements)

References 54 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Studies have also revealed that the emotions can be expressed through universal principles that are shared across different individuals and cultures (Ekman, 1992;Manokara et al, 2021). This motivates the study of multispeaker (Shankar et al, 2019b(Shankar et al, , 2020, and speaker-independent emotional voice conversion (Zhou et al, 2020b;Choi and Hahn, 2021).…”

Section: Related Work Speech Emotion Conversionmentioning

confidence: 99%

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Kreuk¹,

Polyak²,

Copet³

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github. io/emotion.

show abstract

Section: Related Work Speech Emotion Conversionmentioning

confidence: 99%

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Kreuk¹,

Polyak²,

Copet³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…There are only few studies on sequence-to-sequence emotional voice conversion [20], [42], [43], [59]. In [42], the authors jointly model pitch and duration with parallel data, where the model is conditioned on the syllable position in the phrase.…”

Section: Sequence-to-sequence Emotional Voice Conversionmentioning

confidence: 99%

“…One uses auxiliary features such as a state of voiced, unvoiced, and silence (VUS) [17], attention weights or a saliency map [18]. Another manipulates the internal emotion representations through interpolation [19] or scaling [20]. Despite these methods, emotion intensity control is still an under-explored topic in emotional voice conversion.…”

Section: Introductionmentioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou,

Sisman,

Rana

et al. 2022

Preprint

View full text Add to dashboard Cite

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.

show abstract

“…Such framework generally works well in speaker-dependent tasks. Studies have also revealed that the emotions can be expressed through some universal principles that are shared across different individuals and cultures [55,56,57], that motivates the study of multispeaker [58,59,54], and speaker-independent emotional voice conversion [60,61].…”

Section: Introductionmentioning

confidence: 99%

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases. We then motivate the development of a novel emotional speech database (ESD) that addresses the increasing research need. With this paper, the ESD database 1 is now made available to the research community. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies. As case studies, we implement several state-of-the-art emotional voice conversion systems on the ESD database. This paper provides a reference study on ESD in conjunction with its release.

show abstract

Sequence-to-Sequence Emotional Voice Conversion With Strength Control

Cited by 23 publications

References 54 publications

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Emotion Intensity and its Control for Emotional Voice Conversion

Emotional Voice Conversion: Theory, Databases and ESD

Contact Info

Product

Resources

About