2020
DOI: 10.48550/arxiv.2006.04558
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Abstract: Advanced text to speech (TTS) models such as FastSpeech [20] can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-tomany mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech h… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
233
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 153 publications
(237 citation statements)
references
References 23 publications
2
233
0
1
Order By: Relevance
“…These results indicate the effectiveness of Conformer blocks and Mel-based adversarial training method on the singing voice synthesis. From the generated results, baseline models showed limited capability in handling the pitch that falls in a long tail distribution in terms of pitch frequency in the training set 5 . How to deal with this issue raised by the long tail distribution and even to perform a song with note pitch that is beyond the training set pitch distribution should be considered in future work.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…These results indicate the effectiveness of Conformer blocks and Mel-based adversarial training method on the singing voice synthesis. From the generated results, baseline models showed limited capability in handling the pitch that falls in a long tail distribution in terms of pitch frequency in the training set 5 . How to deal with this issue raised by the long tail distribution and even to perform a song with note pitch that is beyond the training set pitch distribution should be considered in future work.…”
Section: Resultsmentioning
confidence: 99%
“…A typical two-stage singing voice synthesis framework, which consists of an acoustic model and a vocoder is adopted in the experiments. In practice, Fastspeech2 [5] and HiFi-GAN [11], which are two popular models for spectrogram synthesis and waveform reconstruction respectively, are utilized in this work.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Apart from the original spectrogram, deep learning architectures have also experimented with non linear spectrograms such as mel-spectrograms [21] [22] [23] [24] [25] [26] [27] or Constant-Q Transformations (CQT) [28]. The mel-spectrogram is generated by the application of perceptual filters on the DFT called mel-filter bands.…”
Section: B Spectrogramsmentioning
confidence: 99%
“…We have attached the modified FastSpeech 2 in appendix F in the supplementary materials. During training, the configuration follows prior work [31]. Since the F0 and duration are usually known in singing voice synthesis, we remove the Table 7: The MOS results with 95% confidence intervals on each Singing voice synthesis system.…”
Section: Singing Voice Synthesis Systemmentioning
confidence: 99%