Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2096
|View full text |Cite
|
Sign up to set email alerts
|

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
19
0
1

Year Published

2021
2021
2022
2022

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(20 citation statements)
references
References 0 publications
0
19
0
1
Order By: Relevance
“…Secondly, inspired by [19], a fine-grained encoder is added at the decoder's tailor, which extracts variable-length detail style information from multiple reference samples via an attention mechanism. It leverage features near to raw reference audio for better generalization.…”
Section: Fastspeech-based Acoustic Modelmentioning
confidence: 99%
“…Secondly, inspired by [19], a fine-grained encoder is added at the decoder's tailor, which extracts variable-length detail style information from multiple reference samples via an attention mechanism. It leverage features near to raw reference audio for better generalization.…”
Section: Fastspeech-based Acoustic Modelmentioning
confidence: 99%
“…In speaker adaptation, all or part of the TTS model is fine-tuned to a small number of audio samples from the unseen speaker [3,4,5,6]. In the speaker encoder approach, embeddings are inferred by an encoder network embedded directly in the TTS model [7,8], or using an auxiliary encoder trained on a large amount of audio-only data [4,9,10]. The latter may be trained on a speaker-discriminative objective [11,12], or on a voice-conversion task [13].…”
Section: Background and Related Workmentioning
confidence: 99%
“…Inspired by recent speech adaptation methods [14,15] at different granularities, in this paper, we introduce the multihead attention-based multi-reference encoder to the zero-shot multi-speaker singing voice synthesis system. Namely, the multi-reference encoder based singing voice synthesis (MR-SVS) system.…”
Section: Introductionmentioning
confidence: 99%