2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021
DOI: 10.1109/asru51503.2021.9688154
|View full text |Cite
|
Sign up to set email alerts
|

On-Device Neural Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 7 publications
0
4
0
Order By: Relevance
“…We trained the models for 300k steps using 16 GPUs and a batch size of 512. We use WaveRNN [41,43] to generate speech from the Mel-spectrograms, trained separately for each speaker. The [M −3σ, M +3σ] spectral tilt values for Voice 1 and 2 are [−0.984, −0.926] and [−0.990, −0.931], respectively.…”
Section: Model Trainingmentioning
confidence: 99%
“…We trained the models for 300k steps using 16 GPUs and a batch size of 512. We use WaveRNN [41,43] to generate speech from the Mel-spectrograms, trained separately for each speaker. The [M −3σ, M +3σ] spectral tilt values for Voice 1 and 2 are [−0.984, −0.926] and [−0.990, −0.931], respectively.…”
Section: Model Trainingmentioning
confidence: 99%
“…More information about the architecture and the on-device implementation of the baseline system can be found in [19].…”
Section: Technical Overviewmentioning
confidence: 99%
“…We train all the models for 3 million steps using a single GPU and batch size of 16. All systems use the same back-end WaveRNN model [19], trained with the 36-hour dataset, to generate speech from the Mel-spectrograms.…”
Section: Modelsmentioning
confidence: 99%
“…Recent attempts to build on-device neural TTS include On-device TTS [7], LiteTTS [8], PortaSpeech [9], LightSpeech [10] and Nix-TTS [11]. On-device TTS is slow and resource intensive since it is a modified Tacotron2 for mel spectrogram generation and uses WaveRNN for vocoder.…”
Section: Introductionmentioning
confidence: 99%