Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1424
|View full text |Cite
|
Sign up to set email alerts
|

Towards Achieving Robust Universal Neural Vocoding

Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in qu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 67 publications
(44 citation statements)
references
References 25 publications
(33 reference statements)
0
44
0
Order By: Relevance
“…4) Recent Progress on Neural Vocoders: More recently, speaker independent WaveRNN-based neural vocoder [63] became popular as it can generate human-like voices from both in-domain and out-of-domain spectrogram [101]- [103]. Another well-known neural vocoder that achieves high-quality synthesis performance is WaveGlow [64].…”
Section: A Speech Analysis and Reconstructionmentioning
confidence: 99%
“…4) Recent Progress on Neural Vocoders: More recently, speaker independent WaveRNN-based neural vocoder [63] became popular as it can generate human-like voices from both in-domain and out-of-domain spectrogram [101]- [103]. Another well-known neural vocoder that achieves high-quality synthesis performance is WaveGlow [64].…”
Section: A Speech Analysis and Reconstructionmentioning
confidence: 99%
“…Neural vocoders such as Wavenet [62] have rapidly become the most commonly used vocoding method for speech synthesis. Although it improved the quality of generated speech, it has significant cost in computation power and data sources, and suffers from poor generalization [50]. To solve this problem, many architectures such as Wave Recurrent Neural Networks (WaveRNN) [36] have been proposed.…”
Section: Ddf For Speech Representationmentioning
confidence: 99%
“…WaveRNN combines linear prediction with recurrent neural networks to synthesize neural audio much faster than other neural synthesizers. In our framework, we use WaveRNN as a decoder with a minor change suggested by [50]. The autoregressive component consists of a single forward gated recurrent unit (GRU) (hidden size of 896) and a pair of affine layers followed by a softmax layer with 1024 outputs, predicting the 10-bit mu-law samples for a 24 kHz sampling rate.…”
Section: Ddf For Speech Representationmentioning
confidence: 99%
“…Model weights are tuned with an ADAM optimizer to minimize the teacher-forced L1 loss between predicted and extracted mel-spectrograms. To complete the TTS pipeline, we convert mel-spectograms to waveforms using the multi-speaker neural vocoder of [19]. This vocoder is trained across 74 speakers and suitable for generating natural speech for our wide-range of adaptation speakers.…”
Section: Base Multi-speaker Modelmentioning
confidence: 99%