Cross-lingual multi-speaker speech synthesis with limited bilingual training data

Cai, Zexin; Yang, Yaogen; Li, Ming

doi:10.1016/j.csl.2022.101427

Cited by 7 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this approach is based on a multi-speaker TTS system as opposed to our single-speaker model. Similarly, data augmentation using a voice conversion module was explored in (Cai et al, 2023) and (Ribeiro et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Joshi,

Garera

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the existing TTS system. Finally, the decoder of this model is fine-tuned on only 3 hours of target Hindi speaker data to enable rapid speaker adaptation. We show the importance of this dual pre-training and decoder-only fine-tuning using subjective MOS evaluation. Using transfer learning from high-resource language and synthetic corpus we present a low-cost solution to train a custom TTS model.

show abstract

Section: Related Workmentioning

confidence: 99%

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Joshi,

Garera

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…C ROSS-lingual text-to-speech (TTS) [1], [2], [3] refers to the task that requires the system to generate speech in a language foreign to a target speaker. This task has many applications, such as code-mixed speech synthesis for a voice agent, foreign movie dubbing [4], and computer-assisted pronunciation teaching [5].…”

Section: Introductionmentioning

confidence: 99%

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin

Li,

Hu,

Cong

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra-and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotiondiscriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotiondiscriminative embedding.

show abstract

“…Voice conversion methods using data augmentation generate parallel data with acoustic features, such as duration, prosody, and energy, similar to the original voice, and then perform parallel voice conversion. Voice conversion methods that utilize nonparallel data based on deep learning have also been studied [18][19][20][21][22]. Other voice conversion methods include text-based approaches.…”

Section: Introductionmentioning

confidence: 99%

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Park,

Lee,

Chun

2023

IEEE Access

View full text Add to dashboard Cite

AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires highly meticulous fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)-a fully self-supervised learning system that uses perturbations to extract speech features-is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.

show abstract

Cross-lingual multi-speaker speech synthesis with limited bilingual training data

Cited by 7 publications

References 26 publications

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Contact Info

Product

Resources

About