Adversarially Trained Autoencoders for Parallel-data-free Voice Conversion

Ocal, Orhan; Elibol, Oğuz H.; Keskin, Gokce; Stephenson, Cory; Thomas, Anil; Ramchandran, Kannan

doi:10.1109/icassp.2019.8683204

Cited by 7 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Saito et al [33] proposed to use PPGs for improving VAE-based VC. Several studies proposed AE-based VC with adversarial learning of hidden representations against speakers information [36], [39], [40]. Polyak et al [39] tried to incorporate an attention module between the encoder and the decoder in a WaveNet-based AE.…”

Section: B Auto-encoder Based Voice Conversionmentioning

confidence: 99%

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Zhang

Ling

Dai

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

103

View full text Add to dashboard Cite

This paper presents a method of sequence-tosequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including a pretraining stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the best non-parallel voice conversion method in Voice Conversion Challenge 2018. Besides, the performance of our proposed method was closed to the stateof-the-art parallel seq2seq voice conversion method.

show abstract

Section: B Auto-encoder Based Voice Conversionmentioning

confidence: 99%

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Zhang

Ling

Dai

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

103

View full text Add to dashboard Cite

show abstract

“…To strengthen the adversarial training, a secondary speaker classifier C s is also applied to the outputs of the first LSTM layer in R. And it's also trained with a classification loss L s and passes an adversarial loss L adv . As indicated by Ocal et al [21], the error rate of the optimal speaker classifier relates to an upper bound of mutual information I(y; H). In order to approximate the optimal classifier, the speaker classifiers are updated K times for each training step in our experiments.…”

Section: Recognition Processmentioning

confidence: 97%

“…Our method is similar to the auto-encoder (AE) based VC with speaker adversarial learning [19][20][21][22]. Polyak et al [19] proposed a WaveNet based AE model for VC with a speaker confusion network.…”

Section: Related Workmentioning

confidence: 99%

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

2020

View full text Add to dashboard Cite

This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. A recognizer is used to transform acoustic features into linguistic representations while a synthesizer recovers output features from the recognizer outputs together with the speaker identity. By separating the speaker characteristics from the linguistic representations, voice conversion can be achieved by replacing the speaker identity with the target one. In our proposed method, a speaker adversarial loss is adopted in order to obtain speaker-independent linguistic representations using the recognizer. Furthermore, discriminators are introduced and a generative adversarial network (GAN) loss is used to prevent the predicted features from being over-smoothed. For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed. Our method achieved higher similarity than the baseline model that obtained the best performance in Voice Conversion Challenge 2018.

show abstract

“…Vector quantization based methods [14] are further proposed to model content information as discrete distributions which are more related to the distribution of phonetic information. An auxiliary adversarial speaker classifier is adopted [15] to encourage the encoder to cast away speaker information from content information by minimizing the mutual information between their representations [16].…”

Section: Introductionmentioning

confidence: 99%

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Wang¹,

Li²,

Zhao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard to ensure robust speech representation disentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial network inspired by BERT. The adversarial network is used to minimize the correlations between the speech representations, by randomly masking and predicting one of the representations from the others. A word prediction network is also adopted to learn a more informative content representation. Experimental results show that the proposed speech representation learning framework significantly improves the robustness of VC on multiple factors by increasing conversion rate from 48.2% to 57.1% and ABX preference exceeding by 31.2% compared with state-of-the-art method.

show abstract

Adversarially Trained Autoencoders for Parallel-data-free Voice Conversion

Cited by 7 publications

References 8 publications

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Contact Info

Product

Resources

About