Building Multilingual End-to-End Speech Synthesisers for Indian Languages

Prakash, Anusha; Thomas, Anju Leela; Umesh, S.; Murthy, Hema A.

doi:10.21437/ssw.2019-35

Cited by 17 publications

(19 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We obtained MLME values from the following studies: [6], [12], [13], [14], [16], [17], [18], [20], [22], [25], [26], and [27], and reported them in Table 3, both as a whole and in specific groups of evaluation metrics, in the form of median (M) and interquartile range (IQR). Also reported are the p-values of the corresponding one-sample Wilcoxon signed rank tests for the hypothesis that the median MLME values are larger than 0.…”

Section: Resultsmentioning

confidence: 99%

“…These resulting values (n = 880) were used for analysis. [6], [7], [8], [9], [10], [11], [12] Hidden Markov Model synthesis (HMM) 7 [12], [13], [14], [15], [16], [17], [18] Neural network (non-S2S) synthesis (DNN) 9 [19], [20], [21], [22], [23], [24], [25], [26], [27] Sequence-to-sequence synthesis (S2S)…”

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

“…Hidden Reference & Anchor) Naturalness [25] DMOS (Degradation MOS) Similarity [18], [22], [27] where v multi and vmono are the reported values of output quality from the corresponding multilingual and monolingual models, respectively, and (*) is the scenario in which the metric m positively correlates with general output quality (the higher, the better, e.g., MOS, MUSHRA, etc. ), as opposed to the opposite correlation (the lower, the better, e.g., MCD, WER, etc.…”

Section: Mushra (Multiple Stimuli Withmentioning

confidence: 99%

“…Since all studies (except for [19] and [26]) mentioned this only in either time duration or number of utterances, an average estimation was needed to convert it to a common measurement. Following the descriptions provided in [10], [11], [12], [16], [17], [20], [21], [22], [23], [24], and [27], we obtained an average utterance length of the speech data sets used in these studies: 6.1 seconds. This was then used to convert all the training data quantities to the corresponding estimated number of utterances.…”

Section: Influential Factors In Data Augmentation Strategy For Multilingual Modelsmentioning

confidence: 99%

See 3 more Smart Citations

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Do¹,

Coler²,

Dijkstra³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We provide a systematic review of past studies that use multilingual data for text-to-speech (TTS) of low-resource languages (LRLs). We focus on the strategies used by these studies for incorporating multilingual data and how they affect output speech quality. To investigate the difference in output quality between corresponding monolingual and multilingual models, we propose a novel measure to compare this difference across the included studies and their various evaluation metrics. This measure, called the Multilingual Model Effect (MLME), is found to be affected by: acoustic model architecture, the difference ratio of target language data between corresponding multilingual and monolingual experiments, the balance ratio of target language data to total data, and the amount of target language data used. These findings can act as reference for data strategies in future experiments with multilingual TTS models for LRLs. Language family classification, despite being widely used, is not found to be an effective criterion for selecting source languages.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

Section: Mushra (Multiple Stimuli Withmentioning

confidence: 99%

Section: Influential Factors In Data Augmentation Strategy For Multilingual Modelsmentioning

confidence: 99%

See 2 more Smart Citations

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Do¹,

Coler²,

Dijkstra³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…The parser leveraged the phonetic similarity among the Indian languages to generate the lexicon. Further [4] explored ways of merging tokens, based either on characters or on phones. Based on subjective evaluations, both approaches gave reasonable quality speech synthesis.…”

Section: Introductionmentioning

confidence: 99%

Exploring the use of Common Label Set to Improve Speech Recognition of Low Resource Indian Languages

Shetty

Umesh

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In many Indian languages, written characters are organized on sound phonetic principles, and the ordering of characters is the same across many of them. However, while training conventional end-to-end (E2E) Multilingual speech recognition systems, we treat characters or target subword units from different languages as separate entities. Since the visual rendering of these characters is different, in this paper, we explore the benefits of representing such similar target subword units (e.g., Byte Pair Encoded(BPE) units) through a Common Label Set (CLS). The CLS can be very easily created using automatic methods since the ordering of characters is the same in many Indian Languages. E2E models are trained using a transformer-based encoder-decoder architecture. During testing, given the Melfilterbank features as input, the system outputs a sequence of BPE units in CLS representation. Depending on the language, we then map the recognized CLS units back to the languagespecific grapheme representation. Results show that models trained using CLS improve over monolingual baseline and a multilingual framework with separate symbols for each language. Similar experiments on a subset of the Voxforge dataset also confirm the benefits of CLS. An extension of this idea is to decode an unseen language (Zero-resource) using CLS trained model.

show abstract

Computational Linguistics‐Based Tamil Character Recognition System for Text to Speech Conversion

Suriya

Balaji²,

Gowtham

et al. 2021

Machine Vision Inspection Systems, Volume 2

View full text Add to dashboard Cite

Building Multilingual End-to-End Speech Synthesisers for Indian Languages

Cited by 17 publications

References 16 publications

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Exploring the use of Common Label Set to Improve Speech Recognition of Low Resource Indian Languages

Computational Linguistics‐Based Tamil Character Recognition System for Text to Speech Conversion

Contact Info

Product

Resources

About