2018
DOI: 10.48550/arxiv.1807.07281
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Abstract: In this work, we propose a new solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (van den Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first text-to-wave neural archite… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
63
0
2

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 111 publications
(65 citation statements)
references
References 14 publications
(16 reference statements)
0
63
0
2
Order By: Relevance
“…ClariNet [69] is also a vocoder that employs knowledge distillation [36]. However, the training process with distillation-based methods remain problems.…”
Section: Vocodersmentioning
confidence: 99%
See 1 more Smart Citation
“…ClariNet [69] is also a vocoder that employs knowledge distillation [36]. However, the training process with distillation-based methods remain problems.…”
Section: Vocodersmentioning
confidence: 99%
“…Autoregressive WaveNet [66],SampleRNN [57] DeepVoice [2],LPCNet [89] Non-autoregressive WaveGlow [72],FloWaveNet [41] WaveFlow [70],Parallel WaveNet [65] ClariNet [69],WaveGAN [20] Parallel WaveGAN [103],MelGAN [45] GAN-TTS [5],HiFi-GAN [44] End-to-End Char2Wav [87],Fastspeech 2s [80] EATs [21],VITS [40] Figure 2: A taxonomy of TTS.…”
Section: Acoustic Modelsmentioning
confidence: 99%
“…The second stage is to synthesize the raw waveform audio from the predicted intermediate representation [17], [18], [19], [20], [21]. In order to simplify the TTS system in terms of training and deployment, end-to-end TTS models have been proposed [22], [23], [24]. However, for the talking head generation task, the intermediate representations of the two-stage approach are useful.…”
Section: A Text-to-speech Synthesismentioning
confidence: 99%
“…In addition, other recent AR models, including sampleRNN [17] and LPCNet [33] have further improved the sound quality. However, due to the large amount of computation and the slow generation speed, researchers currently mainly focus on developing non-AR wave generation models, such as Parallel WaveNet [20], ClariNet [21], GanSynth [5], FloWaveNet [11], MelGan [15], WaveGlow [24], Parallel WaveGan [37], and so on.…”
Section: Introductionmentioning
confidence: 99%