Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

SiCheng, Yang,; Tantrawenith, Methawee; Zhuang, Haolin; Wu, Zhiyong; Sun, Aolan; Wang, Jianzong; cheng, ning; Huaizhen, Tang,; Zhao, Xintao; Wang, Jie; Meng, Helen

doi:10.48550/arxiv.2208.08757

Cited by 4 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MaskVC (Kaneko et al, 2021): The full name is MaskCycleGAN-VC, an extension of CycleGAN-VC2 with the addition of a masking mechanism. SRD (Yang et al, 2022): A voice conversion model that uses mutual information for speech feature disentanglement. Proposed method: Based on MaskCycleGAN-VC, it incorporates LFD and TFAAN.…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…MaskCycleGAN-VC (Kaneko et al, 2021) introduced a mask mechanism to generate higher quality converted speech while keeping the model size manageable. SRD (Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion) (Yang et al, 2022) uses distinct encoders to capture various speech attributes, such as pitch, tone, timbre and rhythm, and uses mutual information to further disentangle different aspects of speech in a self-supervised manner.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-lingual speaker transfer for Cambodian based on feature disentangler and time-frequency attention adaptive normalization

Yang,

Wang,

Gao

et al. 2024

IJWIS

View full text Add to dashboard Cite

Purpose This paper aims to disentangle Chinese-English-rich resources linguistic and speaker timbre features, achieving cross-lingual speaker transfer for Cambodian. Design/methodology/approach This study introduces a novel approach: the construction of a cross-lingual feature disentangler coupled with the integration of time-frequency attention adaptive normalization to proficiently convert Cambodian speaker timbre into Chinese-English without altering the underlying Cambodian speech content. Findings Considering the limited availability of multi-speaker corpora in Cambodia, conventional methods have demonstrated subpar performance in Cambodian speaker voice transfer. Originality/value The originality of this study lies in the effectiveness of the disentanglement process and precise control over speaker timbre feature transfer.

show abstract

Section: Evaluation Metricsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Cross-lingual speaker transfer for Cambodian based on feature disentangler and time-frequency attention adaptive normalization

Yang,

Wang,

Gao

et al. 2024

IJWIS

View full text Add to dashboard Cite

show abstract

“…Jozsef Nemeth et al [25] also proposed an adversarial decoupling method based on group observation to separate content and style-related attributes. Yang et al [47] used a gradient reversal layer (GRL) [11] based adversarial classifier to eliminate speaker information in latent space for voice conversion tasks, extracting features related to speaker identity using a common classifier for timbre. In our work, we adopt the adversarial paradigm to decouple video features into actional and spatial components, inspired by these prior works.…”

Section: Disentangled Representation Learningmentioning

confidence: 99%

Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph Generation with Decoupled Label Learning

Wang¹,

Luo²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-theart VidSGG performance.

show abstract

“…Zero-shot VC methods usually follow auto-encoder frameworks, where the encoder extracts content and speaker representations from speech respectively, and the decoder reconstructs speech by combining the above representations. Hence, speech representation disentanglement is crucial for this task [41,49]. Recently, several zero-shot VC methods [41,49,52] based on information theory have emerged, with the aim of disentangling the content-related and speaker identity-related information.…”

Section: Related Work 21 Voice Conversionmentioning

confidence: 99%

“…Hence, speech representation disentanglement is crucial for this task [41,49]. Recently, several zero-shot VC methods [41,49,52] based on information theory have emerged, with the aim of disentangling the content-related and speaker identity-related information. IDE-VC [52] employed mutual information (MI) with speaker labels as supervision for disentanglement.…”

Section: Related Work 21 Voice Conversionmentioning

confidence: 99%

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Sheng,

Ai,

Chen

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website 1 . CCS CONCEPTS• Information systems → Multimedia content creation.

show abstract

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Cited by 4 publications

References 0 publications

Cross-lingual speaker transfer for Cambodian based on feature disentangler and time-frequency attention adaptive normalization

Cross-lingual speaker transfer for Cambodian based on feature disentangler and time-frequency attention adaptive normalization

Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph Generation with Decoupled Label Learning

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Contact Info

Product

Resources

About