ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053255
|View full text |Cite
|
Sign up to set email alerts
|

Emotional Voice Conversion Using Multitask Learning with Text-To-Speech

Abstract: Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(24 citation statements)
references
References 20 publications
0
24
0
Order By: Relevance
“…There are only few studies on sequence-to-sequence emotional voice conversion [20], [42], [43], [59]. In [42], the authors jointly model pitch and duration with parallel data, where the model is conditioned on the syllable position in the phrase.…”
Section: Sequence-to-sequence Emotional Voice Conversionmentioning
confidence: 99%
“…There are only few studies on sequence-to-sequence emotional voice conversion [20], [42], [43], [59]. In [42], the authors jointly model pitch and duration with parallel data, where the model is conditioned on the syllable position in the phrase.…”
Section: Sequence-to-sequence Emotional Voice Conversionmentioning
confidence: 99%
“…Inspired by the success in speaker voice conversion, multi-task learning between emotional voice conversion and text-to-speech is studied [52]. In this framework, a single sequence-tosequence model is trained to optimize both VC and TTS, in which the VC system benefits from the latent phonetic representation learnt by TTS during the training.…”
Section: Leveraging Tts or Asr Systemsmentioning
confidence: 99%
“…The recent studies on deep learning have seen remarkable performance, such as DNN [16,39,40], highway neural network [41], deep bi-directional long-short-term memory network (DBLSTM) [42], and sequence-to-sequence model [43,44]. Beyond parallel training data, new techniques have been proposed to learn the translation between emotional domains with CycleGAN [45,46] and StarGAN [47], to disentangle the emotional elements from speech with auto-encoders [48,49,50,51], and to leverage text-to-speech (TTS) [52,53] or automatic speech recognition (ASR) [54]. Such framework generally works well in speaker-dependent tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Seeing the drawbacks of the two methods discussed above, a third method, concatenative speech synthesis was proposed in [3,13], which overcomes the limitation of naturalness by concatenating human speech's pre-recorded units. HMM is also one of the methods to synthesize speech as in [13] which uses a hybrid method of HMM. Apart from these methods like, LSTM [1], CNN [2,3,14], RNN [1], and Bi-LSTMs [15] have also been used for synthesizing speech.…”
Section: Speech Synthesis From Textmentioning
confidence: 99%