2021
DOI: 10.48550/arxiv.2102.00184
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Abstract: Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 43 publications
0
1
0
Order By: Relevance
“…[16] proposes a method for few-shot speaker adaptation and generation of an unseen speaker's style by incorporating a non-autoregressive feed-forward Transformer along with adaptive normalization. Adversarial learning was employed in [14] to avoid source speaker leakage in prosody transfer tasks, and in [32] to ensure prosodic disentanglement in voice conversion. Also, in [36], a multispeaker Transformer-based model with an ASR module and an utterance-level prosody encoder is fine-tuned to the target speaker for prosody transfer.…”
Section: Related Workmentioning
confidence: 99%
“…[16] proposes a method for few-shot speaker adaptation and generation of an unseen speaker's style by incorporating a non-autoregressive feed-forward Transformer along with adaptive normalization. Adversarial learning was employed in [14] to avoid source speaker leakage in prosody transfer tasks, and in [32] to ensure prosodic disentanglement in voice conversion. Also, in [36], a multispeaker Transformer-based model with an ASR module and an utterance-level prosody encoder is fine-tuned to the target speaker for prosody transfer.…”
Section: Related Workmentioning
confidence: 99%