From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Cai, Zexin; Zhang, Chuxiong; Li, Ming

doi:10.21437/interspeech.2020-1032

Cited by 31 publications

(24 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the classifier extracts high-level features that are relevant for phone recognition, this loss term supervises the training of WaveNet to look after temporal dynamics, and penalize bad pronunciations. Cai et al [77] study to use a pre-trained speaker embedding network to provide feedback constraint, that serves as the perceptual loss for the training of a multispeaker TTS system.…”

Section: Perceptual Loss For Style Reconstructionmentioning

confidence: 99%

Expressive TTS Training With Frame and Style Reconstruction Loss

Liu

Şişman

Gao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We propose a novel training strategy for Tacotronbased text-to-speech (TTS) system that improves the speech styling at utterance level. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. This study marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. It adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms the state-ofthe-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.

show abstract

Section: Perceptual Loss For Style Reconstructionmentioning

confidence: 99%

Expressive TTS Training With Frame and Style Reconstruction Loss

Liu

Şişman

Gao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…The zero-shot speaker adaptation approach has been used by recent studies to synthesize speech from unseen target speakers without requiring a retraining of the model for those speakers [20] [21] [22] [23]. It utilizes a speaker encoder that has been trained separately for speaker recognition tasks such as automatic speaker identification (ASI) and automatic speaker verification (ASV).…”

Section: Introductionmentioning

confidence: 99%

“…It is widely used to model speaker representation. This includes zero-shot speaker adaptation in Tacotron-based TTS that uses a DNNbased x-vector embedding [20] [22], d-vector embedding [21], and ResNet34 speaker embedding [23].…”

Section: Introductionmentioning

confidence: 99%

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

Azizah

Jatmiko

2022

IEEE Access

View full text Add to dashboard Cite

Deep neural network (DNN)-based systems generally require large amounts of training data, so they have data scarcity problems in low-resource languages. Recent studies have succeeded in building zero-shot multi-speaker DNN-based TTS on high-resource languages, but they still have unsatisfactory performance on unseen speakers. This study addresses two main problems: overcoming the problem of data scarcity in the DNN-based TTS on low-resource languages and improving the performance of zero-shot speaker adaptation for unseen speakers. We propose a novel multi-stage transfer learning strategy using a partial network-based deep transfer learning to overcome the low-resource problem by utilizing pre-trained monolingual single-speaker TTS and d-vector speaker encoder on a high-resource language as the source domain. Meanwhile, to improve the performance of zero-shot speaker adaptation, we propose a new TTS model that incorporates an explicit style control from the target speaker for TTS conditioning and an utterance-level speaker reconstruction loss during TTS training. We use publicly available speech datasets for experiments. We show that our proposed training strategy is able to effectively train the TTS models using a limited amount of training data of low-resource target languages. The models trained using the proposed transfer learning successfully produce intelligible natural speech sounds, while in contrast using standard training fails to make the models synthesize understandable speech. We also demonstrate that our proposed style encoder network and speaker reconstruction loss significantly improves speaker similarity in zero-shot speaker adaptation task compared to the baseline model. Overall, our proposed TTS model and training strategy has succeeded in increasing the speaker cosine similarity of the synthesized speech on the unseen speakers test set by 0.468 and 0.266 in native and foreign languages respectively.

show abstract

“…Previous research about multi-speaker TTS typically uses a speaker representation to control the synthesized utterance's speaker identity. This speaker representation can be jointly learned with the TTS model in the form of an embedding table [4,5,6,7] or a speaker encoder [5,6,8], or can be transferred from another pretrained model for speaker information extraction [9,10]. While to control the style of synthesized speech, global style token (GST) * These authors contributed equally.…”

Section: Introductionmentioning

confidence: 99%

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Chien

Lin

Huang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.

show abstract

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Cited by 31 publications

References 18 publications

Expressive TTS Training With Frame and Style Reconstruction Loss

Expressive TTS Training With Frame and Style Reconstruction Loss

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Contact Info

Product

Resources

About