2022
DOI: 10.48550/arxiv.2205.04421
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Abstract: Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level qu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(25 citation statements)
references
References 27 publications
0
17
0
Order By: Relevance
“…Different from text to speech synthesis that mainly generates mono speech from text [37,38], binaural audio synthesis aims to convert mono audio into its binaural version. Based on the physical process of sound rendering, human listening can be generally considered as a source-medium-receiver model [3].…”
Section: Binaural Audio Synthesismentioning
confidence: 99%
“…Different from text to speech synthesis that mainly generates mono speech from text [37,38], binaural audio synthesis aims to convert mono audio into its binaural version. Based on the physical process of sound rendering, human listening can be generally considered as a source-medium-receiver model [3].…”
Section: Binaural Audio Synthesismentioning
confidence: 99%
“…3 At inference time, we used the drum VQ decoder to convert the drum codes {z d t } to a Mel-spectrogram, which is then turned into the waveform of the drum clip x d by a HiFi-GAN V1 vocoder [42]. We trained the vocoder from scratch with audio of drum sounds from our dataset for 2.5 days, and then, inspired by [50,51], fine-tuned it on the reconstructed Mel-spectrograms of the Drum VQ decoder.…”
Section: Experiments Setupmentioning
confidence: 99%
“…Typical machine learning tasks, in the field of natural language processing [49,13,77,18,8], speech [2,34,51,79,71], computer vision [25,69,45,33,28], and etc, usually handle a mapping from source data X to target data Y . For example, X is image and Y is class label in image classification [17]; X is style tag and Y is sentence in style-controlled text generation [50]; X is text and Y is speech in text-to-speech synthesis [70,71].…”
Section: Introduction 1data Understanding and Generationmentioning
confidence: 99%
“…Depending on the relative amount of information that X and Y contain, these mappings can be divided into data understanding [45,18], data generation [28,8], and the combination of data understanding and generation [1,31,29,10,71]. Figure 1 shows the three types of tasks and the relative information between X and Y : • Data understanding tasks, in which X contains much more information than Y (e.g., image classification [17,45], objective detection [27,60], sentence classification [90], machine reading comprehension [55]).…”
Section: Introduction 1data Understanding and Generationmentioning
confidence: 99%
See 1 more Smart Citation