2021
DOI: 10.48550/arxiv.2106.10132
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Abstract: One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 32 publications
(40 reference statements)
0
9
0
Order By: Relevance
“…These approaches can also be found in the audio domain, such as speech augmentation [63], [90], [64], [69], GAN-based speech synthesis [33], [65], [59], [21], and VAE-based speech synthesis [62], [55], [117]. Specifically, VC [94], [75], [70], [104], [31] is a specific data synthesis approach that can utilize a source speaker's speech to generate more voice samples that sound like a target speaker. Recent studies [107], [40] have revealed that it can be difficult for humans to distinguish whether the speech generated by a VC method is real or fake.…”
Section: A One-shot Voice Conversionmentioning
confidence: 99%
See 1 more Smart Citation
“…These approaches can also be found in the audio domain, such as speech augmentation [63], [90], [64], [69], GAN-based speech synthesis [33], [65], [59], [21], and VAE-based speech synthesis [62], [55], [117]. Specifically, VC [94], [75], [70], [104], [31] is a specific data synthesis approach that can utilize a source speaker's speech to generate more voice samples that sound like a target speaker. Recent studies [107], [40] have revealed that it can be difficult for humans to distinguish whether the speech generated by a VC method is real or fake.…”
Section: A One-shot Voice Conversionmentioning
confidence: 99%
“…Existing VC studies [60], [61] have shown that intra-gender VC (e.g., female to female) appears to have better performance than inter-gender one (e.g., female to male). As a major difference between male and female voices is the pitch feature [70], [104], [75], which represents the basic frequency information of an audio signal, our intuition is that selecting a source speaker whose voice has the pitch feature similar to the target speaker may improve the VC performance. Therefore, for an attacker that knows a short speech sample of the target speaker to generate more parrot speech samples, the first step in our design is to find the best source speaker in a speech dataset (which can be a public dataset or the attacker's own dataset) such that the source speaker has the minimum average pitch distance to the target speaker.…”
Section: B Parrot Speech Sample Generation and Performancementioning
confidence: 99%
“…Mainstream SVC models can be grouped into three categories : 1) parallel spectral feature mapping models, which learn the conversion function between source and target singers relying on parallel singing data (Villavicencio and Bonada, 2010;Kobayashi et al, 2015;Sisman et al, 2019); 2) Cycle-consistent Generative Adversarial Networks (CycleGAN) (Zhu et al, 2017;Kaneko et al, 2019), where an adversarial loss and a cycleconsistency loss are concurrently used to learn the forward and inverse mappings simultaneously (Sisman and Li, 2020); 3) encoder-decoder models, such as PPG-SVC (Li et al, 2021), which leverage a singing voice synthesis (SVS) system for SVC , and auto-encoder (Qian et al, 2019a;Wang et al, 2021b;Yuan et al, 2020) based SVC (Wang et al, 2021a). The models of the latter two categories can be utilized with nonparallel data.…”
Section: Singing Voice Conversionmentioning
confidence: 99%
“…Therefore, we set the model as depicted in Section 4 as our baseline model, which directly uses raw speech to localize the video fragments without any audio pretrain process, also we implement VSLNet [49] with audio by ourselves. Besides, we compare our proposed method with another audio pretrain model, VQCPC [43], which combine VQ-VAE [42] and CPC together to extract speech representations. We also use the DeepSpeech to translate the noise audio into text to test the performance of the baseline model with these ASR transcripts.…”
Section: Compared With Baseline Modelmentioning
confidence: 99%