Data-Efficient Training Strategies for Neural TTS Systems

Prajwal, K R; Jawahar, C. V.

doi:10.1145/3430984.3431034

Cited by 7 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the context of multilingual E2E training for Indian languages, [54] trains convolutional attention-based TTS with language, speaker and gender embeddings. In [56], pretraining strategies are explored between source and target languages, which enable the training of multilingual voices with a reduced amount of data. In [58], byte inputs are mapped to spectrograms and experiments are performed with 40+ languages, including Hindi, Tamil and Telugu.…”

Section: Related Workmentioning

confidence: 99%

“…Going ahead, the training data per language can be further reduced to assess extreme data-stressed situations. To improve the synthesis quality of seen languages, generic voices can be further fine-tuned on seen languages, as explored in [28], [56]. Additional embeddings, such as language embeddings, can be included during training.…”

Section: A Analysis Of Phonotactics Across Languagesmentioning

confidence: 99%

See 1 more Smart Citation

Exploring the Role of Language Families for Building Indic Speech Synthesisers

Prakash

Murthy

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Building end-to-end speech synthesisers for Indian languages is challenging, given the lack of adequate clean training data and multiple grapheme representations across languages. This work explores the importance of training multilingual and multi-speaker text-to-speech (TTS) systems based on language families. The objective is to exploit the phonotactic properties of language families, where small amounts of accurately transcribed data across languages can be pooled together to train TTS systems. These systems can then be adapted to new languages belonging to the same family in extremely low-resource scenarios.TTS systems are trained separately for Indo-Aryan and Dravidian language families, and their performance is compared to that of a combined Indo-Aryan+Dravidian voice. We also investigate the amount of training data required for a language in a multilingual setting. Same-family and cross-family synthesis and adaptation to unseen languages are analysed. The analyses show that language family-wise training of Indic systems is the way forward for the Indian subcontinent, where a large number of languages are spoken.Index Terms-end-to-end speech synthesis, Indian languages, language families, low-resource• This work is one of the first attempts to study the importance of language families in the context of speech synthesis.• We compare language family-specific Indo-Aryan (IA) and Dravidian (Dr) models with a combined Indo-Aryan+Dravidian (IA+Dr) system. • We also assess the performance of models trained in datastressed situations. We reduce the training data used per language in the multilingual voice.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: A Analysis Of Phonotactics Across Languagesmentioning

confidence: 99%

Exploring the Role of Language Families for Building Indic Speech Synthesisers

Prakash

Murthy

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Self-Supervised Training [51,368,433,78,140,346,197,352,71] Cross-Lingual Transfer LRSpeech [390], [42,12,60,271,105] Cross-Speaker Transfer [216,125,59,39] Speech Chain/ Back Transformation SpeechChain [344,345], LRSpeech [390,285] Dataset Mining in the Wild [58,119,57] Robust Enhancing Attention Tacotron 2 [376], DCTTS [326], SMA [104] MultiSpeech [38], [309,297,431,326,264,262] Replacing Attention with Duration…”

Section: Lightweight Modelmentioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models, and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

show abstract

“…This is mainly due to the lack of child voice datasets and difficulty in creating such datasets. As TTS models require hundreds of hours of annotated data for training [2], performing TTS for child voices can be quite challenging. The focus of this work is to explore the potential of state-of-the-art (SOTA) TTS to build a pipeline for the synthesis of children's voices with low data requirements.…”

Section: Introductionmentioning

confidence: 99%

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

et al. 2022

View full text Add to dashboard Cite

Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.INDEX TERMS Text-to-speech, child speech synthesis, tacotron, multi-speaker TTS, alternative WaveRNN, MOSNet, subjective MOS.

show abstract

Data-Efficient Training Strategies for Neural TTS Systems

Cited by 7 publications

References 11 publications

Exploring the Role of Language Families for Building Indic Speech Synthesisers

Exploring the Role of Language Families for Building Indic Speech Synthesisers

A Survey on Neural Speech Synthesis

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Contact Info

Product

Resources

About