2022
DOI: 10.1016/j.specom.2021.12.002
|View full text |Cite
|
Sign up to set email alerts
|

Investigating a neural all pass warp in modern TTS applications

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…These variables make it challenging for the TTS system to adapt to every individual's voice [9]. (3) Tendency to Overfit -Overtraining a TTS system based on training data from specific speaker(s) can result in overfitting, where the TTS system becomes too specialized to the training data and performs poorly on new data, i.e., a new speaker's voice or new or rare vocabulary [10]. (4) Challenges in Fine-tuning -Being able to fine-tune a TTS system to a specific speaker's voice can be challenging because it requires adjusting the parameters of an initial model.…”
Section: Adapting To a Specific Speaker's Voicementioning
confidence: 99%
“…These variables make it challenging for the TTS system to adapt to every individual's voice [9]. (3) Tendency to Overfit -Overtraining a TTS system based on training data from specific speaker(s) can result in overfitting, where the TTS system becomes too specialized to the training data and performs poorly on new data, i.e., a new speaker's voice or new or rare vocabulary [10]. (4) Challenges in Fine-tuning -Being able to fine-tune a TTS system to a specific speaker's voice can be challenging because it requires adjusting the parameters of an initial model.…”
Section: Adapting To a Specific Speaker's Voicementioning
confidence: 99%
“…Zero-shot attempts to customize target speech from an unseen target speaker's speech by extracting a speaker embedding from the original target speaker's dataset without using parameters. Investigations indicate improved speaker similarity and demonstrate that the neural all-pass warp (APW) using Tacotron2 (encoder-decoder architecture) raises the generalizability of a multi-speaker model with a zero-shot speaker adaptation [53]. However, the zero-shot method usually suffers from inadequate speaker similarity.…”
Section: Speaker Adaptationmentioning
confidence: 99%
“…Kumar N et al presented a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) [25]. Compared to the normalization architecture, ZSM-SS added non-autoregressive multi-head attention between the encoder-decoder architecture [26][27][28].…”
Section: Ttsmentioning
confidence: 99%