Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

Choi, Sukgeun; Han, Seungju; Kim, Dong‐Young; Ha, Sungjoo

doi:10.21437/interspeech.2020-2096

Cited by 31 publications

(20 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Secondly, inspired by [19], a fine-grained encoder is added at the decoder's tailor, which extracts variable-length detail style information from multiple reference samples via an attention mechanism. It leverage features near to raw reference audio for better generalization.…”

Section: Fastspeech-based Acoustic Modelmentioning

confidence: 99%

The Thinkit System for Icassp2021 M2voc Challenge

Shang

Zhang

Chen

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we introduce the low resource text-to-speech system from the ThinkIT team submitted to Multi-Speaker Multi-Style Voice Cloning Challenge (M2VoC). The challenge has two tasks: few-shot track1 provides 100 samples for each person and one-shot track2 offers 5 samples only. Each track contains two sub-tracks A and B. Instead of subtrack A, sub-track B can use extra public data besides the released data. But we participate in the sub-track A only. We choose the finetune as our backbone strategy. Our submitted systems include BERT based prosody boundary prediction module, FastSpeech based acoustic model to generate acoustic features from text input, and HIFIGAN based vocoder to generate waveform from acoustic features. Among them, acoustic models are susceptible to low resource speakers. To prevent over-fitting, we modified the acoustic model and split out validation set to assist the manual model selection. Evaluation results provided by the challenges organizers demonstrate the effectiveness of our system.

show abstract

Section: Fastspeech-based Acoustic Modelmentioning

confidence: 99%

The Thinkit System for Icassp2021 M2voc Challenge

Shang

Zhang

Chen

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In speaker adaptation, all or part of the TTS model is fine-tuned to a small number of audio samples from the unseen speaker [3,4,5,6]. In the speaker encoder approach, embeddings are inferred by an encoder network embedded directly in the TTS model [7,8], or using an auxiliary encoder trained on a large amount of audio-only data [4,9,10]. The latter may be trained on a speaker-discriminative objective [11,12], or on a voice-conversion task [13].…”

Section: Background and Related Workmentioning

confidence: 99%

Speaker Generation

Stanton¹,

Shannon²,

Mariooryad³

et al. 2021

Preprint

View full text Add to dashboard Cite

This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attentionbased text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page 1 .

show abstract

“…Inspired by recent speech adaptation methods [14,15] at different granularities, in this paper, we introduce the multihead attention-based multi-reference encoder to the zero-shot multi-speaker singing voice synthesis system. Namely, the multi-reference encoder based singing voice synthesis (MR-SVS) system.…”

Section: Introductionmentioning

confidence: 99%

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Wang¹,

Liu²,

Ren³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multi-speaker singing voice synthesis is to generate the singing voice sung by different speakers. To generalize to new speakers, previous zero-shot singing adaptation methods obtain the timbre of the target speaker with a fixed-size embedding from single reference audio. However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice. In this paper, we propose a new model called MR-SVS to tackle these problems. Specifically, we employ both a multi-reference encoder and a fixed-size encoder to encode the timbre of the target speaker from multiple reference audios. The Multi-reference encoder can capture more details and variations of the target timbre. Besides, we propose a well-designed pitch shift method to address the pitch inconsistency problem. Experiments indicate that our method outperforms the baseline method both in naturalness and similarity. 1

show abstract

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

Cited by 31 publications

References 0 publications

The Thinkit System for Icassp2021 M2voc Challenge

The Thinkit System for Icassp2021 M2voc Challenge

Speaker Generation

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Contact Info

Product

Resources

About