Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Chien, Chung-Ming; Lin, Jheng-Hao; Huang, Chien‐Yu; Hsu, Po‐Chun; Lee, Hung-yi

doi:10.1109/icassp39728.2021.9413880

Cited by 36 publications

(27 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is helpful in multi-speaker synthesis, as its goal is different from speaker verification task. Previous studies [24] suggest that the continuous distribution of speaker embeddings has better performance in the multi-speaker TTS task. Our experiment results in similarity tests confirm these studies [24].…”

Section: Discussionmentioning

confidence: 93%

See 1 more Smart Citation

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Xue¹,

Deng²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, neural network based methods for multispeaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate highquality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-ofthe-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

show abstract

Section: Discussionmentioning

confidence: 93%

“…Previous studies [24] suggest that the continuous distribution of speaker embeddings has better performance in the multi-speaker TTS task. Our experiment results in similarity tests confirm these studies [24]. As a result, using ECAPA-TDNN as a speaker encoder can achieve better speech naturalness and speaker similarity.…”

Section: Discussionmentioning

confidence: 93%

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Xue¹,

Deng²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…MTL has widely been used in computer vision and a recent work [25] has implemented a MTL model to work on 12 different datasets while achieving the state-of-the-art with 11 of them. MTL has also been explored in Automatic Speech Recognition (ASR) tasks [26,27], text-to-speech (TTS) [28] and in speech emotion recognition (SER) [29,30]. Cai et al [30] recently presented the state-of-the-art results for the SER task with IEMO-CAP dataset using their model based on a MTL framework.…”

Section: Multi-task Learning : Related Workmentioning

confidence: 99%

Acoustic-to-articulatory Speech Inversion with Multi-task Learning

Siriwardena¹,

Sivaraman²,

Espy‐Wilson³

2022

Preprint

View full text Add to dashboard Cite

Multi-task learning (MTL) frameworks have proven to be effective in diverse speech related tasks like automatic speech recognition (ASR) and speech emotion recognition. This paper proposes a MTL framework to perform acoustic-to-articulatory speech inversion by simultaneously learning an acoustic to phoneme mapping as a shared task. We use the Haskins Production Rate Comparison (HPRC) database which has both the electromagnetic articulography (EMA) data and the corresponding phonetic transcriptions. Performance of the system was measured by computing the correlation between estimated and actual tract variables (TVs) from the acoustic to articulatory speech inversion task. The proposed MTL based Bidirectional Gated Recurrent Neural Network (RNN) model learns to map the input acoustic features to nine TVs while outperforming the baseline model trained to perform only acoustic to articulatory inversion.

show abstract

“…We use Montreal forced alignment (MFA) [24] to extract the forced alignment given audio-text pair. In consistency with [5], the forced alignment is a sequence of monophones and there is a total number of 72 monophones. In the next phase, the content prior encoder E cp takes the one-hot form of alignment sequence as input at each time step and predicts the frame-wise content prior distribution p(z c |A F A X ).…”

Section: Acoustic Alignment As Content Conditionmentioning

confidence: 99%

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

Lian¹,

Zhang²,

Anumanchipalli³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer developed from the perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for the system development. Specifically, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts the FA to the unsupervised alignment (UA). Finally, a Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE), serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.

show abstract

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Cited by 36 publications

References 10 publications

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Acoustic-to-articulatory Speech Inversion with Multi-task Learning

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

Contact Info

Product

Resources

About