Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors

Doddipatla, Rama; Braunschweiler, Norbert; Maia, Ranniery

doi:10.21437/interspeech.2017-1038

Cited by 37 publications

(31 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our future work includes comparing our method with other adaptation methods such as LHUC and SVD bottleneck speaker adaptation with low-rank approximation. Another interesting experiment we would like to see is the use of i-vector or d-vector [24] as a scaling code.…”

Section: Discussionmentioning

confidence: 99%

Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-Based Speech Synthesis Systems

Luong

Yamagishi

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Most neural-network based speaker-adaptive acoustic models for speech synthesis can be categorized into either layerbased or input-code approaches. Although both approaches have their own pros and cons, most existing works on speaker adaptation focus on improving one or the other. In this paper, after we first systematically overview the common principles of neural-network based speaker-adaptive models, we show that these approaches can be represented in a unified framework and can be generalized further. More specifically, we introduce the use of scaling and bias codes as generalized means for speaker-adaptive transformation. By utilizing these codes, we can create a more efficient factorized speakeradaptive model and capture advantages of both approaches while reducing their disadvantages. The experiments show that the proposed method can improve the performance of speaker adaptation compared with speaker adaptation based on the conventional input code.

show abstract

Section: Discussionmentioning

confidence: 99%

Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-Based Speech Synthesis Systems

Luong

Yamagishi

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…As mentioned earlier, several techniques for speaker adaptation using i-vectors [5] or d-vectors [15] have been developed. As for the former, i-vectors are directly used as inputs for DNN-based speech synthesis.…”

Section: Advantage Of Proposed Frameworkmentioning

confidence: 99%

“…An unsupervised speaker-adaptation technique using a bottle-neck layer of a DNN-based speaker-recognition model for DNN-based speech synthesis was proposed by Doddipatla et al [15]. As for this technique, PCA is applied to the bottle-neck features of the DNN-based speaker recognition, and the first eigenvector is interpolated on the basis of the posterior probabilities of the speaker-recognition model.…”

Section: Advantage Of Proposed Frameworkmentioning

confidence: 99%

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

Takaki

Nishimura²,

Yamagishi

2018

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

A new speaker-adaptation technique for deep neural network (DNN)-based speech synthesis-which requires only speech data without orthographic transcriptions-is proposed. This technique is based on a DNN-based speech-synthesis model that takes speaker, gender, and age into consideration as additional inputs and outputs acoustic parameters of corresponding voices from text in order to construct a multi-speaker model and perform speaker adaptation. It uses a new input code that represents acoustic similarity to each of the training speakers in a probability. The new input code, called "speaker-similarity vector," is obtained by concatenating posterior probabilities calculated from each model of the training speakers. GMM-UBM or i-vector/PLDA, which are widely used in text-independent speaker verification, are used to represent the speaker models, since they can be used without text information. Text and the speaker-similarity vectors of the training speakers are used as input to first train a multi-speaker speech-synthesis model, which outputs acoustic parameters of the training speakers. A new speaker-similarity vector is then estimated by using a small amount of speech data uttered by an unknown target speaker on the basis of the separately trained speaker models. It is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speaker's voice. In objective and subjective experiments, adaptation performance of the proposed technique was evaluated using not only studioquality adaptation data but also low-quality (i.e., noisy and reverberant) data. The results of the experiments indicate that the proposed technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis.

show abstract

“…An effective way to solve this problem is to use a technique like speaker adaptation [16]- [20], in which a baseline model is trained using a large database, then adjusted to a target speaker using only a small amount of data. This approach can similarly be applied to expressiveness tasks through emotion transplantation, i.e.…”

Section: Introductionmentioning

confidence: 99%

Effective Emotion Transplantation in an End-to-End Text-to-Speech System

et al. 2020

View full text Add to dashboard Cite

In this paper, we propose an effective technique to transplant a source speaker's emotional expression to a new target speaker's voice within an end-to-end text-to-speech (TTS) framework. We modify an expressive TTS model pre-trained using a source speaker's emotional speech database to reflect the voice characteristics of a target speaker for which only a neutral speech database is available. We set two adaptation criteria to achieve this. One criterion is to minimize the reconstruction loss between the target speaker's recorded and synthesized speech, such that the synthesized speech has the target speaker's voice characteristics. The other criterion is to minimize the emotion loss between the emotion embedding vectors extracted from the reference expressive speech and the target speaker's synthesized expressive speech, which is essential to preserve expressiveness. Since the two criteria are applied alternately in the adaptation process, we are able to avoid the kind of bias issues frequently encountered in similar tasks. The proposed adaptation technique demonstrates more effective performance compared to conventional approaches in both quantitative and qualitative evaluations.

show abstract

Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors

Cited by 37 publications

References 22 publications

Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-Based Speech Synthesis Systems

Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-Based Speech Synthesis Systems

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

Effective Emotion Transplantation in an End-to-End Text-to-Speech System

Contact Info

Product

Resources

About