FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Ren, Yi; Hu, Chenxu; Tan, Xu; Qin, Tao; Zhao, Shenghui; Zhao, Zhou; Liu, Tie-Yan

doi:10.48550/arxiv.2006.04558

Cited by 153 publications

(237 citation statements)

References 23 publications

Supporting

Mentioning

233

Contrasting

Unclassified

Order By: Relevance

“…These results indicate the effectiveness of Conformer blocks and Mel-based adversarial training method on the singing voice synthesis. From the generated results, baseline models showed limited capability in handling the pitch that falls in a long tail distribution in terms of pitch frequency in the training set 5 . How to deal with this issue raised by the long tail distribution and even to perform a song with note pitch that is beyond the training set pitch distribution should be considered in future work.…”

Section: Resultsmentioning

confidence: 99%

“…A typical two-stage singing voice synthesis framework, which consists of an acoustic model and a vocoder is adopted in the experiments. In practice, Fastspeech2 [5] and HiFi-GAN [11], which are two popular models for spectrogram synthesis and waveform reconstruction respectively, are utilized in this work.…”

Section: Methodsmentioning

confidence: 99%

“…The development of neural end-to-end text-to-speech (TTS) models [1,2,3,4,5,6] has greatly promoted speech synthesis. Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper introduces Opencpop, a publicly available highquality Mandarin singing corpus designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin songs performed by a female professional singer. Audio files are recorded with studio quality at a sampling rate of 44,100 Hz and the corresponding lyrics and musical scores are provided. All singing recordings have been phonetically annotated with phoneme boundaries and syllable (note) boundaries. To demonstrate the reliability of the released data and to provide a baseline for future research, we built baseline deep neural network-based SVS models and evaluated them with both objective metrics and subjective mean opinion score (MOS) measure. Experimental results show that the best SVS model trained on our database achieves 3.70 MOS, indicating the reliability of the provided corpus. Opencpop is released to the open-source community WeNet 1 , and the corpus, as well as synthesized demos, can be found on the project homepage 2 .

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Apart from the original spectrogram, deep learning architectures have also experimented with non linear spectrograms such as mel-spectrograms [21] [22] [23] [24] [25] [26] [27] or Constant-Q Transformations (CQT) [28]. The mel-spectrogram is generated by the application of perceptual filters on the DFT called mel-filter bands.…”

Section: B Spectrogramsmentioning

confidence: 99%

Audio representations for deep learning in sound synthesis: A review

Anastasia¹,

O’Leary²

2022

Preprint

View full text Add to dashboard Cite

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficientlyand complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

show abstract

“…We have attached the modified FastSpeech 2 in appendix F in the supplementary materials. During training, the configuration follows prior work [31]. Since the F0 and duration are usually known in singing voice synthesis, we remove the Table 7: The MOS results with 95% confidence intervals on each Singing voice synthesis system.…”

Section: Singing Voice Synthesis Systemmentioning

confidence: 99%

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Huang

Chen

Ren

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective.3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline. Samples are available at https://Multi-Singer.github.io/ CCS CONCEPTS• Applied computing → Sound and music computing; • Computing methodologies → Natural language generation.

show abstract

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Cited by 153 publications

References 23 publications

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Audio representations for deep learning in sound synthesis: A review

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Contact Info

Product

Resources

About