2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383551
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

Abstract: In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
77
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 132 publications
(77 citation statements)
references
References 23 publications
0
77
0
Order By: Relevance
“…Human hearing is highly sensitive to irregularities and discontinuities in the periodic nature of any audio [9], which makes it even harder for any NN to generate human-like speech audio. However, in recent years, researchers have achieved extraordinary success using neural networks for generating speech audio from text [15]. Most of these successes are dominated by the autoregressive models.…”
Section: A Audio Generationmentioning
confidence: 99%
“…Human hearing is highly sensitive to irregularities and discontinuities in the periodic nature of any audio [9], which makes it even harder for any NN to generate human-like speech audio. However, in recent years, researchers have achieved extraordinary success using neural networks for generating speech audio from text [15]. Most of these successes are dominated by the autoregressive models.…”
Section: A Audio Generationmentioning
confidence: 99%
“…The authors of [20]- [22] incorporated a flow-based generative model based on Glow [42], that can be directly learned by minimizing the negative log-likelihood of data without a distillation process. Another group of non-AR models [25]- [29] is based on an adversarial training framework [43]. Combining adversarial loss and auxiliary loss, such as multi-resolution short-time Fourier transform (STFT) loss and feature matching loss, enables non-AR models to be learned efficiently.…”
Section: Neural Waveform Generative Modelsmentioning
confidence: 99%
“…However, due to the high computational cost and timeconsumed generation, real-time applications could become challenges. Researchers have adopted multi-band generation such as Multiband-WaveRNN [42], Multiband-MelGAN [40] to speed up waveform modeling. Related multi-band vocoders generate each sub-band of waveform, and then conduct bands splicing using Pseudo Quadrature Mirror Filter Bank (PQMF) [24].…”
Section: Multi-band Generartionmentioning
confidence: 99%
“…Further, although several singing voice datasets such as MIR-1K dataset [15] and JukeBox [7] have been released for research purposes, but the corpora are not so large as expected for multiple tasks. 2) Several parallel generation methods [40,42] have been proposed to speed up waveforms synthesis. However, existing multi-band architectures do not consider characteristic differences among frequency bands, so a powerful frequency-adapted multi-band technique is required.…”
Section: Introductionmentioning
confidence: 99%