Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

Hsu, Chin-Cheng; Hwang, Hsin-Te; Wu, Yi-Chiao; Tsao, Yu; Wang, Hsin‐Min

doi:10.48550/arxiv.1704.00849

Cited by 89 publications

(60 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VC performance is critically dependent on the availability of the target speaker's voice data for training [11][12][13][14][15][16][17]. Hence, the challenge of one-shot VC is in performing conversion across arbitrary speakers that may be unseen during training, and with only one single target-speaker utterance for reference.…”

Section: Related Workmentioning

confidence: 99%

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Wang¹,

Deng²,

Yeung³

et al. 2021

Preprint

View full text Add to dashboard Cite

One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.

show abstract

Section: Related Workmentioning

confidence: 99%

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Wang¹,

Deng²,

Yeung³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Auto-encoder [116] is one of the techniques which is commonly used for speech disentanglement and reconstruction. Studies have shown the effectiveness of autoencoder [24] and its variants [25,128,129] in disentangling the speaker information from the content, thus they are widely used in speaker voice conversion [130] and singing voice conversion [131].…”

Section: Disentanglement Between Emotional Prosody and Linguistic Con...mentioning

confidence: 99%

“…In this way, we learn the framelevel content-related representations from the speech in an unsupervised manner [50,60,48]. VAE-GAN [50] and VAW-GAN [25,60] are successful attempts. Another idea is for an autoencoder to learn an emotion-invariant latent code and an emotion-related style code in latent space [49].…”

Section: Disentanglement Between Emotional Prosody and Linguistic Con...mentioning

confidence: 99%

“…For effective modeling, parallel training data are required in general. Recently, voice conversion techniques with non-parallel training data have been studied, for example, domain translation [21,22], multitask learning [23] and speaker disentanglement [24,25,26] are among the successful attempts. The recent progress in voice conversion becomes the source of inspiration for emotional voice conversion studies.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases. We then motivate the development of a novel emotional speech database (ESD) that addresses the increasing research need. With this paper, the ESD database 1 is now made available to the research community. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies. As case studies, we implement several state-of-the-art emotional voice conversion systems on the ESD database. This paper provides a reference study on ESD in conjunction with its release.

show abstract

“…Voice conversion has taken some major strides in terms of speech quality and speaker similarity. Various approaches have been proposed, such as Gaussian mixture model (GMM) [3,4,5], frequency warping approaches [6,7,8], exemplar based methods [9,10,11], and neural network based methods [12,13,14,15,16,17,18,19,20,21]. Recently, disentangling speaker and linguistic content representations based on deep learning for voice conversion [22,23,24,25] has received much attention.…”

Section: Introductionmentioning

confidence: 99%

Noise-robust voice conversion with domain adversarial training

Du¹,

Xie²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Voice conversion has made great progress in the past few years under the studio-quality test scenario in terms of speech quality and speaker similarity. However, in real applications, test speech from source speaker or target speaker can be corrupted by various environment noises, which seriously degrade the speech quality and speaker similarity. In this paper, we propose a novel encoderdecoder based noise-robust voice conversion framework, which consists of a speaker encoder, a content encoder, a decoder, and two domain adversarial neural networks. Specifically, we integrate disentangling speaker and content representation technique with domain adversarial training technique. Domain adversarial training makes speaker representations and content representations extracted by speaker encoder and content encoder from clean speech and noisy speech in the same space, respectively. In this way, the learned speaker and content representations are noise-invariant. Therefore, the two noise-invariant representations can be taken as input by the decoder to predict the clean converted spectrum. The experimental results demonstrate that our proposed method can synthesize clean converted speech under noisy test scenarios, where the source speech and target speech can be corrupted by seen or unseen noise types during the training process. Additionally, both speech quality and speaker similarity are improved.

show abstract

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

Cited by 89 publications

References 15 publications

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Emotional Voice Conversion: Theory, Databases and ESD

Noise-robust voice conversion with domain adversarial training

Contact Info

Product

Resources

About