Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Wang, Jie; Li, Jingbei; Zhao, Xintao; Wu, Zhiyong; Meng, Helen

doi:10.48550/arxiv.2102.00184

Cited by 3 publications

(1 citation statement)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[16] proposes a method for few-shot speaker adaptation and generation of an unseen speaker's style by incorporating a non-autoregressive feed-forward Transformer along with adaptive normalization. Adversarial learning was employed in [14] to avoid source speaker leakage in prosody transfer tasks, and in [32] to ensure prosodic disentanglement in voice conversion. Also, in [36], a multispeaker Transformer-based model with an ASR module and an utterance-level prosody encoder is fine-tuned to the target speaker for prosody transfer.…”

Section: Related Workmentioning

confidence: 99%

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Christidou,

Vioni,

Ellinas

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

show abstract

Section: Related Workmentioning

confidence: 99%

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Christidou,

Vioni,

Ellinas

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Barakat,

Turk,

Demiroglu

2024

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.

show abstract

Cycleflow: Purify Information Factors by Cycle Loss

Sun¹,

Chen²,

Li³

et al. 2022

The Speaker and Language Recognition Workshop (Odyssey 2022)

View full text Add to dashboard Cite

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Cited by 3 publications

References 43 publications

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Cycleflow: Purify Information Factors by Cycle Loss

Contact Info

Product

Resources

About