ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413880
|View full text |Cite
|
Sign up to set email alerts
|

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Abstract: The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker repre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 36 publications
(27 citation statements)
references
References 10 publications
0
25
0
Order By: Relevance
“…This is helpful in multi-speaker synthesis, as its goal is different from speaker verification task. Previous studies [24] suggest that the continuous distribution of speaker embeddings has better performance in the multi-speaker TTS task. Our experiment results in similarity tests confirm these studies [24].…”
Section: Discussionmentioning
confidence: 93%
See 1 more Smart Citation
“…This is helpful in multi-speaker synthesis, as its goal is different from speaker verification task. Previous studies [24] suggest that the continuous distribution of speaker embeddings has better performance in the multi-speaker TTS task. Our experiment results in similarity tests confirm these studies [24].…”
Section: Discussionmentioning
confidence: 93%
“…Previous studies [24] suggest that the continuous distribution of speaker embeddings has better performance in the multi-speaker TTS task. Our experiment results in similarity tests confirm these studies [24]. As a result, using ECAPA-TDNN as a speaker encoder can achieve better speech naturalness and speaker similarity.…”
Section: Discussionmentioning
confidence: 93%
“…MTL has widely been used in computer vision and a recent work [25] has implemented a MTL model to work on 12 different datasets while achieving the state-of-the-art with 11 of them. MTL has also been explored in Automatic Speech Recognition (ASR) tasks [26,27], text-to-speech (TTS) [28] and in speech emotion recognition (SER) [29,30]. Cai et al [30] recently presented the state-of-the-art results for the SER task with IEMO-CAP dataset using their model based on a MTL framework.…”
Section: Multi-task Learning : Related Workmentioning
confidence: 99%
“…We use Montreal forced alignment (MFA) [24] to extract the forced alignment given audio-text pair. In consistency with [5], the forced alignment is a sequence of monophones and there is a total number of 72 monophones. In the next phase, the content prior encoder E cp takes the one-hot form of alignment sequence as input at each time step and predicts the frame-wise content prior distribution p(z c |A F A X ).…”
Section: Acoustic Alignment As Content Conditionmentioning
confidence: 99%