2021
DOI: 10.48550/arxiv.2106.15561
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Survey on Neural Speech Synthesis

Xu Tan,
Tao Qin,
Frank Soong
et al.

Abstract: Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
71
0
3

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 60 publications
(85 citation statements)
references
References 286 publications
(782 reference statements)
0
71
0
3
Order By: Relevance
“…This was done because although the latest works in ZS-TTS [3,4,10] only use the VCTK dataset, this dataset has a limited number of speakers (109) and little variety of recording conditions. Thus, after training with only this dataset, in general, ZS-TTS models do not generalize satisfactorily to new speakers where recording conditions or voice characteristics are very different than those seen in the training [12].…”
Section: Methodsmentioning
confidence: 93%
See 3 more Smart Citations
“…This was done because although the latest works in ZS-TTS [3,4,10] only use the VCTK dataset, this dataset has a limited number of speakers (109) and little variety of recording conditions. Thus, after training with only this dataset, in general, ZS-TTS models do not generalize satisfactorily to new speakers where recording conditions or voice characteristics are very different than those seen in the training [12].…”
Section: Methodsmentioning
confidence: 93%
“…The different recording conditions are a challenge for the generalization of the zero-shot multi-speaker TTS models. In addition, speakers who have a voice that differs greatly from those seen in training also become a challenge [12]. Nevertheless, to show the potential of our model for adaptation to new speakers/recording conditions, we selected from 20 to 61 seconds of speech for 2 speakers (1M/1F) from Portuguese and the same for English in the Common Voice [37] dataset.…”
Section: Speaker Adaptationmentioning
confidence: 99%
See 2 more Smart Citations
“…Considering the aforementioned benefits, TTS is undoubtedly an essential speech processing technology for any language. In recent years, TTS research has progressed remarkably thanks to neural network-based architectures (Tan et al, 2021), regularly organized challenges (Black and Tokuda, 2005;Dunbar et al, 2019), and open-source datasets (Ito and Johnson, 2017;Zen et al, 2019;Shi et al, 2020). Especially, impressive results have been achieved for commercially viable languages, such as English and Mandarin.…”
Section: Introductionmentioning
confidence: 99%