NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu, Tao; Chen, Jiawei; Liu, Haohe; Cong, Jiajia; Zhang, Chen; Liu, Yanqing; Wang, Xi; Leng, Yichong; Yi, Yuan-Hao; Li, He; Soong, Frank K.; Qin, Tao; Zhao, S. J.; Liu, Tie-Yan

doi:10.48550/arxiv.2205.04421

Cited by 13 publications

(25 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Different from text to speech synthesis that mainly generates mono speech from text [37,38], binaural audio synthesis aims to convert mono audio into its binaural version. Based on the physical process of sound rendering, human listening can be generally considered as a source-medium-receiver model [3].…”

Section: Binaural Audio Synthesismentioning

confidence: 99%

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Leng¹,

Chen²,

Guo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples are available online 3 .

show abstract

Section: Binaural Audio Synthesismentioning

confidence: 99%

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Leng¹,

Chen²,

Guo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…3 At inference time, we used the drum VQ decoder to convert the drum codes {z d t } to a Mel-spectrogram, which is then turned into the waveform of the drum clip x d by a HiFi-GAN V1 vocoder [42]. We trained the vocoder from scratch with audio of drum sounds from our dataset for 2.5 days, and then, inspired by [50,51], fine-tuned it on the reconstructed Mel-spectrograms of the Drum VQ decoder.…”

Section: Experiments Setupmentioning

confidence: 99%

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE

Wu¹,

Chiu²,

Yang³

2022

Preprint

View full text Add to dashboard Cite

This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drumfree recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vectorquantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audiodomain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.

show abstract

“…Typical machine learning tasks, in the field of natural language processing [49,13,77,18,8], speech [2,34,51,79,71], computer vision [25,69,45,33,28], and etc, usually handle a mapping from source data X to target data Y . For example, X is image and Y is class label in image classification [17]; X is style tag and Y is sentence in style-controlled text generation [50]; X is text and Y is speech in text-to-speech synthesis [70,71].…”

Section: Introduction 1data Understanding and Generationmentioning

confidence: 99%

“…Depending on the relative amount of information that X and Y contain, these mappings can be divided into data understanding [45,18], data generation [28,8], and the combination of data understanding and generation [1,31,29,10,71]. Figure 1 shows the three types of tasks and the relative information between X and Y : • Data understanding tasks, in which X contains much more information than Y (e.g., image classification [17,45], objective detection [27,60], sentence classification [90], machine reading comprehension [55]).…”

Section: Introduction 1data Understanding and Generationmentioning

confidence: 99%

“…• Data understanding/generation tasks, in which X contains no significantly more or less information than Y (e.g., image transfer [91], text-to-image synthesis [57,58,86,64,11], neural machine translation [1,31], text-to-speech synthesis [71,70], automatic speech recognition [34]). In this case, we need both data understanding capability on the source X and data generation capability on the target Y .…”

Section: Introduction 1data Understanding and Generationmentioning

confidence: 99%

See 1 more Smart Citation

Regeneration Learning: A Learning Paradigm for Data Generation

Xu¹,

Qin²,

Bian³

et al. 2023

Preprint

View full text Add to dashboard Cite

Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y . The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y (an abstraction/representation of Y ) from X and then generates Y from Y . During training, Y is obtained from Y through either handcrafted rules or selfsupervised learning and is used to learn X → Y and Y → Y . Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y ) of the target data Y for data generation while traditional representation learning handles the abstraction (X ) of source data X for data understanding; 2) both the processes of Y → Y in regeneration learning and X → X in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y in regeneration learning and from X to Y in representation learning are simpler than the direct mapping from X to Y . We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.

show abstract

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Cited by 13 publications

References 27 publications

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE

Regeneration Learning: A Learning Paradigm for Data Generation

Contact Info

Product

Resources

About