Emotional Voice Conversion Using Dual Supervised Adversarial Networks With Continuous Wavelet Transform F0 Features

Luo, Zhaojie; Chen, Jinhui; Takiguchi, Tetsuya; Ariki, Yasuo

doi:10.1109/taslp.2019.2923951

Cited by 26 publications

(21 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In general, both categorical and dimensional representations of emotions have been widely used in both emotion recognition [85] and emotional voice conversion [86,45,87,83]. The study on representation learning [70] represents a new way of emotion representation, that further calls for large scale emotion-annotated speech data.…”

Section: Emotion Representationmentioning

confidence: 99%

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases. We then motivate the development of a novel emotional speech database (ESD) that addresses the increasing research need. With this paper, the ESD database 1 is now made available to the research community. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies. As case studies, we implement several state-of-the-art emotional voice conversion systems on the ESD database. This paper provides a reference study on ESD in conjunction with its release.

show abstract

Section: Emotion Representationmentioning

confidence: 99%

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Meanwhile, emotional voice conversion mainly has done with frame-based conversion [11,12] or rule-based approach [13]. These have limitations since DTW does not ensure the exact alignment and rule-based approach has a limitation to model voice conversion.…”

Section: Related Workmentioning

confidence: 99%

Emotional Voice Conversion Using Multitask Learning with Text-To-Speech

Kim

Cho

Choi

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male Korean emotional text-speech dataset, and it is shown that multitask learning is helpful to keep linguistic contents in VC.

show abstract

“…As prosody plays an important role in expressing emotional speech, several studies have focused on modelling spectral and fundamental frequency (F0) features with parallel data. Some previous works have explored prosody and spectral mapping separately using GMM [14]- [16], FNN [17], deep belief network (DBN) [18], and GAN [19] methods. Ming et al [20] converted the spectrum and F0 simultaneously with bidirectional long-short term memory (LSTM) using parallel data.…”

Section: Introductionmentioning

confidence: 99%

Sequence-to-Sequence Emotional Voice Conversion With Strength Control

Choi

Hahn

2021

IEEE Access

View full text Add to dashboard Cite

This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have different speeds and rhythms depending on the emotions. To solve this, the proposed method adopts a sequence-to-sequence network with an attention module that enables the network to learn attention in the neutral input sequence should be focused on which part of the emotional output sequence. Besides, to capture the multi-attribute aspects of emotional variations, an emotion encoder is designed for transforming acoustic features into emotion embedding vectors. By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. By introducing a speaker encoder, the proposed method can preserve speaker identity even after the emotion conversion. Objective and subjective evaluation results confirm that the proposed method is superior to other previous works. Especially, in emotion strength control, we achieve in getting successful results.

show abstract

Emotional Voice Conversion Using Dual Supervised Adversarial Networks With Continuous Wavelet Transform F0 Features

Cited by 26 publications

References 31 publications

Emotional Voice Conversion: Theory, Databases and ESD

Emotional Voice Conversion: Theory, Databases and ESD

Emotional Voice Conversion Using Multitask Learning with Text-To-Speech

Sequence-to-Sequence Emotional Voice Conversion With Strength Control

Contact Info

Product

Resources

About