On-Device Neural Speech Synthesis

Achanta, Sivanand; Antony, Albert; Golipour, Ladan; Li, Jiangchuan; Raitio, Tuomo; Rasipuram, Ramya; Rossi, Francesco; Shi, Jennifer; Upadhyay, Jaimin; Winarsky, David; Zhang, Hepeng

doi:10.1109/asru51503.2021.9688154

Cited by 8 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We trained the models for 300k steps using 16 GPUs and a batch size of 512. We use WaveRNN [41,43] to generate speech from the Mel-spectrograms, trained separately for each speaker. The [M −3σ, M +3σ] spectral tilt values for Voice 1 and 2 are [−0.984, −0.926] and [−0.990, −0.931], respectively.…”

Section: Model Trainingmentioning

confidence: 99%

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

Raitio¹,

Petkov²,

Jiangchuan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing the spectral tilt parameter and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. The evaluations show that the proposed method can improve the intelligibility of synthetic speech with little loss in speech quality.

show abstract

Section: Model Trainingmentioning

confidence: 99%

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

Raitio¹,

Petkov²,

Jiangchuan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…More information about the architecture and the on-device implementation of the baseline system can be found in [19].…”

Section: Technical Overviewmentioning

confidence: 99%

“…We train all the models for 3 million steps using a single GPU and batch size of 16. All systems use the same back-end WaveRNN model [19], trained with the 36-hour dataset, to generate speech from the Mel-spectrograms.…”

Section: Modelsmentioning

confidence: 99%

Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features

Raitio

Rasipuram

Castellani³

2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a model conditioned on sentencewise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles, while maintaining similar mean opinion score (4.23) to our Tacotron baseline (4.26).

show abstract

“…Recent attempts to build on-device neural TTS include On-device TTS [7], LiteTTS [8], PortaSpeech [9], LightSpeech [10] and Nix-TTS [11]. On-device TTS is slow and resource intensive since it is a modified Tacotron2 for mel spectrogram generation and uses WaveRNN for vocoder.…”

Section: Introductionmentioning

confidence: 99%

EfficientSpeech: An On-Device Text to Speech Model

Atienza

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.

show abstract

On-Device Neural Speech Synthesis

Cited by 8 publications

References 7 publications

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features

EfficientSpeech: An On-Device Text to Speech Model

Contact Info

Product

Resources

About