A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

Sodimana, Keshan; Pipatsrisawat, Knot; Ha, Linne; Jansche, Martin; Kjartansson, Oddur; Silva, Pasindu De; Sarin, Supheakmungkol

doi:10.21437/sltu.2018-14

Cited by 26 publications

(24 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work utilizes publicly available datasets: LJSpeech [50], an English speech corpus with a total duration of about 24 hours; TITML-IDN [51], an Indonesian (ID) speech corpus with an average of 43 minutes for each speaker; OpenSLR jv-ID [52], a Javanese (JV) speech corpus with an average of 10 minutes for each speaker; OpenSLR su-ID [52], a Sundanese (SU) speech corpus with an average of 7 minutes for each speaker. T2 model for Indonesian, Javanese, and Sundanese uses a subset of corpus consisting of one female speaker for each language as shown in Table 1.…”

Section: A Datasetmentioning

confidence: 99%

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

2020

View full text Add to dashboard Cite

This work applies a hierarchical transfer learning to implement deep neural network (DNN)based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this paper, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio. INDEX TERMS deep neural network, hierarchical transfer learning, low-resource, multi-speaker, multilingual, style transfer, text-to-speech

show abstract

Section: A Datasetmentioning

confidence: 99%

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

2020

View full text Add to dashboard Cite

show abstract

“…In order to gather data for new languages, we use a questionnaire asking language consultants to describe all the ways written tokens in various domains can be verbalized (see also [10,11]). We then need to convert this information to a machine-readable format so that it can be used in verbalizer grammars.…”

Section: Verbalization Templatesmentioning

confidence: 99%

“…We then need to convert this information to a machine-readable format so that it can be used in verbalizer grammars. Initially [10,11], this was performed by populating a Thrax [14] grammar template. However, we have moved this system to Pynini [15], a Python library which inherits the functionality of Thrax and can use Python's extensive libraries and testing frameworks.…”

Section: Verbalization Templatesmentioning

confidence: 99%

“…In our system, TTS clients can select alternative verbalizations, normally available only for ASR, by selecting a verbalization 'style'. For example, the money template produces the following variants for '$1.50': (11) a. one United States dollar and fifty cents b. one dollar and fifty cents c. one dollar fifty d. one fifty In the unified system, the rules which produce these variants can be shared across the two platforms. For TTS, we specify a style number for each of these outputs, and clients of the TTS system can select among them for different purposes.…”

Section: Advantages Of Unified Approachmentioning

confidence: 99%

“…With an induced number names grammar and a customized template verbalizer, basic verbalizations of major semiotic classes can be produced without the need for complex custom grammars, paving the way to scaling verbalization to more languages in the future. This research forms part of a wider research effort at Google investigating how language technology can be scaled to more languages quickly [8,9,10,11].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Unified Verbalization for Speech Recognition & Synthesis Across Languages

et al. 2019

View full text Add to dashboard Cite

We describe a new approach to converting written tokens to their spoken form, which can be shared by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems while exploiting linguistic commonalities to provide simple default verbalizations. We also describe improvements to an induction system for number names grammars. Between these shared ASR/TTS verbalizers and the improved induction system for number names grammars, we achieve significant gains in development time and scalability across languages.

show abstract

Enhancing the Quality of Nepali Text-to-Speech Systems

Ghimire

Bal

2017

Communications in Computer and Information Science

View full text Add to dashboard Cite

In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). The primary objective of this survey is to conduct a comprehensive review of the works on Nepali Automatic Speech Recognition Systems completed to date, explore the different datasets used, examine the technology utilized, and take account of the obstacles encountered in implementing the Nepali ASR system.In tandem with the global trends of ever-increasing research on speech recognition based research, the number of Nepalese ASR-related projects are also growing. Nevertheless, the investigation of language and acoustic models of the Nepali language has not received adequate attention compared to languages that possess ample resources. In this context, we provide a framework as well as directions for future investigations.

show abstract

A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

Cited by 26 publications

References 4 publications

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Unified Verbalization for Speech Recognition & Synthesis Across Languages

Enhancing the Quality of Nepali Text-to-Speech Systems

Contact Info

Product

Resources

About