Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Huang, Wen-Chin; Wu, Yi-Chiao; Lo, Chen-Chou; Tobing, Patrick Lumban; Hayashi, Tomoki; Kobayashi, Kenzo; Toda, Tomoki; Tsao, Yu; Wang, Hsin‐Min

doi:10.21437/interspeech.2019-1774

Cited by 14 publications

(12 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following [68], the latent space and speaker representation were set to 16-dimensional. We used a mini-batch of 16 and the Adam optimizer with a fixed learning rate of 0.0001.…”

Section: Experimental Evaluations a Experimental Settingsmentioning

confidence: 99%

“…In this section, we investigate the degree of disentanglement of the VC models involved in this study. We use a novel metric that was recently proposed in [68] as the disentanglement measurement, termed DEM. The main design concept of DEM is that a pair of sentences of the same content uttered by the source and target speakers should have similar latent codes since the phonetic contents are the same.…”

Section: F Disentanglement Measurementioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Huang

Luo

Hwang

et al. 2020

IEEE Trans. Emerg. Top. Comput. Intell.

Self Cite

View full text Add to dashboard Cite

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods.

show abstract

“…Following [68], the latent space and speaker representation were set to 16-dimensional. We used a mini-batch of 16 and the Adam optimizer with a fixed learning rate of 0.0001.…”

Section: Experimental Evaluations a Experimental Settingsmentioning

confidence: 99%

Section: F Disentanglement Measurementioning

confidence: 99%

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Huang

Luo

Hwang

et al. 2020

IEEE Trans. Emerg. Top. Comput. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“…We believe that a means to address this problem is to separate emotional features from emotion-independent features. Autoencoder (AE)-or variational autoencoder (VAE)-based feature disentanglement methods provide a possible solution and have been successfully applied to the voice conversion (VC) task [15,16,17,18,19,20]. In [17], by providing the speaker identity features as a condition to the decoder, the encoder learns to encode only the speaker-independent information in the process of minimising the reconstruction loss.…”

Section: Introductionmentioning

confidence: 99%

“…Autoencoder (AE)-or variational autoencoder (VAE)-based feature disentanglement methods provide a possible solution and have been successfully applied to the voice conversion (VC) task [15,16,17,18,19,20]. In [17], by providing the speaker identity features as a condition to the decoder, the encoder learns to encode only the speaker-independent information in the process of minimising the reconstruction loss. Experimental results from [20] show that the degree of disentanglement is positively correlated with the performance of the VC model and can be enhanced by both GANs and the speaker classifier.…”

Section: Introductionmentioning

confidence: 99%

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation

He¹,

Chen²,

Rizos³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2 % in Micro-F1 and 5 % in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.

show abstract

“…Subjective evaluation results of the FCN-CDVAE based VC method[37] with a waveform generation process by the WORLD vocoder or the proposed method. Here M and F denotes male and female, respectively.…”

mentioning

confidence: 99%

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Huang¹,

Wu²,

Kobayashi³

et al. 2019

10th ISCA Workshop on Speech Synthesis (SSW 10)

Self Cite

View full text Add to dashboard Cite

We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framework, a model that can estimate the spectral differential from the F0 transformed input speech needs to be trained beforehand. This requirement imposes several constraints, including a limitation on the estimation model to parallel training and the need of extra training on each conversion pair, which make DIFFVC inflexible. Based on the above motivations, we propose a new DIFFVC framework based on an F0 transformation in the residual domain. By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential. We describe several details that need to be taken care of under this modification, and by applying our proposed method to a non-parallel, variational autoencoder (VAE)-based spectral conversion model, we demonstrate that this framework can be generalized to any spectral conversion model, and experimental evaluations show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder.

show abstract

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Cited by 14 publications

References 28 publications

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Contact Info

Product

Resources

About