Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Yang, Yaogen; Zhang, Haozhe; Qin, Xiaoyi; Liang, Shanshan; Cui, Huahua; Xu, Mingyang; Li, Ming

doi:10.48550/arxiv.2104.10832

Cited by 1 publication

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to the experiment results in [15], although the pre-trained content encoder can provide precise linguistic information, it still contains speaker information. Thus we apply the speaker information remover as shown in figure 1 to remove the speaker information, and ideally, we can obtain purified content information.…”

Section: Supervision On Intermediate Representationmentioning

confidence: 99%

“…Regarding the linguistic content, a pre-trained acoustic model is applied to extract the linguistic feature. Experiments in [15] show that there is speaker information persisting in the linguistic information extracted by the acoustic recognition model. Therefore, we manage to eliminate the residual speaker information and get purified content information.…”

Section: Introductionmentioning

confidence: 98%

See 1 more Smart Citation

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Zhang¹,

Cai²,

Qin³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice conversion. We aim to obtain intermediate representations for speaker-content disentanglement of speech to better remove speaker information and get pure content information. Accordingly, our proposed framework contains a module that removes the speaker information from the acoustic feature of the source speaker. Moreover, speaker information control is added to our system to maintain the voice cloning performance. The proposed system is evaluated by subjective and objective metrics. Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.

show abstract

Section: Supervision On Intermediate Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 98%