Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition

Ding, Shaojin; Zhao, Guanlong; Gutierrez‐Osuna, Ricardo

doi:10.21437/interspeech.2020-1033

Cited by 12 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pretraining a speaker recognition system offers the advantage of using large-scale speaker databases, enabling the learned speaker representation to exhibit high speaker similarity in several multi-speaker speech generation frameworks [19], [43]- [45]. On the other hand, joint training provides a more flexible optimization process dedicated to the speech synthesis task, providing further insights to characterize the speaker details [46], [47].…”

Section: Neural Speaker Encodingmentioning

confidence: 99%

Cross-Lingual Propaganda Detection

Zhang

2022

2022 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

This paper proposes RefXVC, a method for crosslingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed existing systems in terms of both speech quality and speaker similarity, highlighting the effectiveness of leveraging reference information in crosslingual voice conversion. The converted speech samples can be found on the website: http://refxvc.dn3point.com

show abstract

Section: Neural Speaker Encodingmentioning

confidence: 99%

Cross-Lingual Propaganda Detection

Zhang

2022

2022 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

show abstract

“…Both GAN and Flow-based models have in common is that they all bypass the problem of feature decoupling and convert speech directly, while there are also some other works [9,10,11,12,13,14] attempting to disentangle the styling unit and content unit in the embedding space. The purpose is obvious, with content information and timbre information are obtained respectively, it is easy for us to fix the content embedding while replacing the style embedding to convert the voice.…”

Section: Introductionmentioning

confidence: 99%

“…One type of methods is based on the automatic speech recognition (ASR) model [9,10,11,15]. Firstly, a pretrained speaker-independent ASR model was employed to extract linguistic-related features (e.g.…”

Section: Introductionmentioning

confidence: 99%

“…Then, a synthesis model is applied to generate an utterance, of which pronunciation characteristic is very similar to the target speech. Especially arXiv:2208.04035v1 [cs.SD] 8 Aug 2022 in [11], the pronunciation characteristic of the target speech is represented by the d-vector extracted by a pre-trained speaker recognition model, and an adversarial learning approach is used to get more pure linguistic information from phonetic posteriorgrams.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Huaizhen

Zhang

Wang

et al. 2021

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using informationconstraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separation effect of content and speaker identity. In this paper, a novel voice conversion framework, named T ext Guided AutoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content. In addition, the adversarial training is applied to eliminate the speaker identity information in the estimated content embedding extracted from speech. Under the guidance of the expected content embedding and the adversarial training, the content encoder is trained to extract speaker-independent content embedding from speech. Experiments on AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of naturalness and similarity of converted speech.

show abstract