2022
DOI: 10.48550/arxiv.2208.08757
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…MaskVC (Kaneko et al, 2021): The full name is MaskCycleGAN-VC, an extension of CycleGAN-VC2 with the addition of a masking mechanism. SRD (Yang et al, 2022): A voice conversion model that uses mutual information for speech feature disentanglement. Proposed method: Based on MaskCycleGAN-VC, it incorporates LFD and TFAAN.…”
Section: Evaluation Metricsmentioning
confidence: 99%
See 1 more Smart Citation
“…MaskVC (Kaneko et al, 2021): The full name is MaskCycleGAN-VC, an extension of CycleGAN-VC2 with the addition of a masking mechanism. SRD (Yang et al, 2022): A voice conversion model that uses mutual information for speech feature disentanglement. Proposed method: Based on MaskCycleGAN-VC, it incorporates LFD and TFAAN.…”
Section: Evaluation Metricsmentioning
confidence: 99%
“…MaskCycleGAN-VC (Kaneko et al, 2021) introduced a mask mechanism to generate higher quality converted speech while keeping the model size manageable. SRD (Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion) (Yang et al, 2022) uses distinct encoders to capture various speech attributes, such as pitch, tone, timbre and rhythm, and uses mutual information to further disentangle different aspects of speech in a self-supervised manner.…”
Section: Introductionmentioning
confidence: 99%
“…Jozsef Nemeth et al [25] also proposed an adversarial decoupling method based on group observation to separate content and style-related attributes. Yang et al [47] used a gradient reversal layer (GRL) [11] based adversarial classifier to eliminate speaker information in latent space for voice conversion tasks, extracting features related to speaker identity using a common classifier for timbre. In our work, we adopt the adversarial paradigm to decouple video features into actional and spatial components, inspired by these prior works.…”
Section: Disentangled Representation Learningmentioning
confidence: 99%
“…Zero-shot VC methods usually follow auto-encoder frameworks, where the encoder extracts content and speaker representations from speech respectively, and the decoder reconstructs speech by combining the above representations. Hence, speech representation disentanglement is crucial for this task [41,49]. Recently, several zero-shot VC methods [41,49,52] based on information theory have emerged, with the aim of disentangling the content-related and speaker identity-related information.…”
Section: Related Work 21 Voice Conversionmentioning
confidence: 99%
“…Hence, speech representation disentanglement is crucial for this task [41,49]. Recently, several zero-shot VC methods [41,49,52] based on information theory have emerged, with the aim of disentangling the content-related and speaker identity-related information. IDE-VC [52] employed mutual information (MI) with speaker labels as supervision for disentanglement.…”
Section: Related Work 21 Voice Conversionmentioning
confidence: 99%