Jing-Xuan Zhang scite author profile

This paper presents a method of sequence-tosequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including a pretraining stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the best non-parallel voice conversion method in Voice Conversion Challenge 2018. Besides, the performance of our proposed method was closed to the stateof-the-art parallel seq2seq voice conversion method.

show abstract

Improving Sequence-to-sequence Voice Conversion by Adding Text-supervision

Zhang

Ling

Jiang

et al. 2019

View full text Add to dashboard Cite

This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method proposed in our previous work achieved higher naturalness and similarity. In this paper, we further improve its performance by utilizing the text transcriptions of parallel training data. First, a multi-task learning structure is designed which adds auxiliary classifiers to the middle layers of the seq2seq model and predicts linguistic labels as a secondary task. Second, a data-augmentation method is proposed which utilizes text alignment to produce extra parallel sequences for model training. Experiments are conducted to evaluate our proposed method with training sets at different sizes. Experimental results show that the multi-task learning with linguistic labels is effective at reducing the errors of seq2seq voice conversion. The data-augmentation method can further improve the performance of seq2seq voice conversion when only 50 or 100 training utterances are available.

show abstract

Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization

Zhang

Chen

et al. 2019

View full text Add to dashboard Cite

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Zhang¹,

Ling²,

Dai³

2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jing-Xuan Zhang

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Improving Sequence-to-sequence Voice Conversion by Adding Text-supervision

Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Contact Info

Product

Resources

About