Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Byambadorj, Zolzaya; Nishimura, Ryota; Ayush, Altangerel; Ohta, Kengo; Kitaoka, Norihide

doi:10.1186/s13636-021-00225-4

Cited by 13 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clean Lombard speech is a speech under the Lombard effect but without noise in the audio. In our experiment, the clean Lombard speech is a synthetic Lombard speech generated by modifying the prosody of normal speech (intensity, pitch, duration) into Lombard speech using SoX audio manipulation toolkit [41], [42]. Noises were not included in the resulting audio.…”

Section: B Training Methodsmentioning

confidence: 99%

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

Novitasari

Sakti

Nakamura

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Recent end-to-end text-to-speech synthesis (TTS) systems have successfully synthesized high-quality speech. However, TTS speech intelligibility degrades in noisy environments because most of these systems were not designed to handle noisy environments. Several works attempted to address this problem by using offline fine-tuning to adapt their TTS to noisy conditions. Unlike machines, humans never perform offline fine-tuning. Instead, they speak with the Lombard effect in noisy places, where they dynamically adjust their vocal effort to improve the audibility of their speech. This ability is supported by the speech chain mechanism, which involves auditory feedback passing from speech perception to speech production. This paper proposes an alternative approach to TTS in noisy environments that is closer to the human Lombard effect. Specifically, we implement Lombard TTS in a machine speech chain framework to synthesize speech with dynamic adaptation. Our TTS performs adaptation by generating speech utterances based on the auditory feedback that consists of the automatic speech recognition (ASR) loss as the speech intelligibility measure and the speech-to-noise ratio (SNR) prediction as power measurement. Two versions of TTS are investigated: non-incremental TTS with utterancelevel feedback and incremental TTS (ITTS) with short-term feedback to reduce the delay without significant performance loss. Furthermore, we evaluate the TTS systems in both static and dynamic noise conditions. Our experimental results show that auditory feedback enhanced the TTS speech intelligibility in noise.

show abstract

Section: B Training Methodsmentioning

confidence: 99%

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

Novitasari

Sakti

Nakamura

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Deep learning however relies heavily on a substantial quantity of training data [34], [35] such that it was stated in [33], [36] that DNN is not a suitable technique for TTS in low-resource languages. In [37] however, techniques such as monolingual transfer learning, cross-lingual transfer learning, multi-speaker models, multilingual models, and data augmentation have been proposed as means of augmenting TTS for low-resource languages.…”

Section: Text To Speech Translationmentioning

confidence: 99%

Towards Yoruba-Speaking Google Maps Navigation

Oyesanmi,

Olukanmi

2024

Preprint

View full text Add to dashboard Cite

Advances in natural language processing (NLP) have made several technological interventions and services available to people in different languages. One such service is the Google Maps direction narration which provides real-time oral assistance to tourists, and visitors in a new or unknown location. Like most related assistive technologies, this service is primarily developed in the English language with support for some other Western languages over time, and the African languages are largely neglected. This paper seeks to leverage advances in NLP techniques and models in the design of a speech-to-speech (STS) translation of the Google Maps direction narration in English to the Yoruba language, one of the most widely spoken languages in Western Africa. We begin with an exploration of various state-of-the-art NLP techniques for Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-speech (TTS) models that make up the designed system. We presented the performance of the models we explored towards the design and implementation of a robust STS translation of the Google Maps direction narration in the Yoruba language.

show abstract

“…Cross-lingual transfer learning and data augmentation approach for low resource TTS were proposed in (Byambadorj et al, 2021). The spectrogram prediction network was trained using crosslingual transfer learning (TL) from high resource language, data augmentation by varying parameters like pitch and speed, and a combination of two approaches.…”

Section: Related Workmentioning

confidence: 99%

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Joshi,

Garera

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the existing TTS system. Finally, the decoder of this model is fine-tuned on only 3 hours of target Hindi speaker data to enable rapid speaker adaptation. We show the importance of this dual pre-training and decoder-only fine-tuning using subjective MOS evaluation. Using transfer learning from high-resource language and synthetic corpus we present a low-cost solution to train a custom TTS model.

show abstract

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Cited by 13 publications

References 13 publications

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

Towards Yoruba-Speaking Google Maps Navigation

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Contact Info

Product

Resources

About