Raahil Shah scite author profile

Raahil Shah

4Publications

39Citation Statements Received

78Citation Statements Given

How they've been cited

How they cite others

Affiliations

Amazon (United States), Amazon (United Kingdom)

Publications

Order By: Most citations

Low-Resource Expressive Text-To-Speech Using Data Augmentation

Huybrechts

Merritt

Comini

et al. 2021

View full text Add to dashboard Cite

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings. First, we augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers. Next, we use that synthetic data on top of the available recordings to train a TTS model. Finally, we fine-tune that model to further increase quality. Our evaluations show that the proposed changes bring significant improvements over non-augmented models across many perceived aspects of synthesised speech. We demonstrate the proposed approach on 2 styles (newscaster and conversational), on various speakers, and on both single and multi-speaker models, illustrating the robustness of our approach. 1

show abstract

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Shah¹,

Pokora²,

Ezzerg³

et al. 2021

View full text Add to dashboard Cite

Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work [1], a 3step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (≈ 10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a nonautoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.

show abstract

Low-resource expressive text-to-speech using data augmentation

Huybrechts¹,

Merritt²,

Comini³

et al. 2020

Preprint

View full text Add to dashboard Cite

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Shah¹,

Pokora²,

Ezzerg³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Raahil Shah

Low-Resource Expressive Text-To-Speech Using Data Augmentation

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Low-resource expressive text-to-speech using data augmentation

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Contact Info

Product

Resources

About