Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling

Korte, Marcel de; Kim, Jaebok; Klabbers, Esther

doi:10.21437/interspeech.2020-2664

Cited by 14 publications

(10 citation statements)

References 18 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We obtained MLME values from the following studies: [6], [12], [13], [14], [16], [17], [18], [20], [22], [25], [26], and [27], and reported them in Table 3, both as a whole and in specific groups of evaluation metrics, in the form of median (M) and interquartile range (IQR). Also reported are the p-values of the corresponding one-sample Wilcoxon signed rank tests for the hypothesis that the median MLME values are larger than 0.…”

Section: Resultsmentioning

confidence: 99%

“…These resulting values (n = 880) were used for analysis. [6], [7], [8], [9], [10], [11], [12] Hidden Markov Model synthesis (HMM) 7 [12], [13], [14], [15], [16], [17], [18] Neural network (non-S2S) synthesis (DNN) 9 [19], [20], [21], [22], [23], [24], [25], [26], [27] Sequence-to-sequence synthesis (S2S)…”

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

“…From the chosen studies, notable studies include [18] (hereafter Study A), the latest included DNN-based study, which used 5 different evaluation metrics for TTS in Tibetan (with Mandarin as the source language). For S2S-based studies, among the latest are [27] (Study B), which explored TTS for Indic LRLs in the Indo-Aryan and Dravidian families, and [25] (Study C), which investigated strategies for using Dutch and other European languages to aid a limited amount of English data.…”

Section: Notable Studiesmentioning

confidence: 99%

“…Hidden Reference & Anchor) Naturalness [25] DMOS (Degradation MOS) Similarity [18], [22], [27] where v multi and vmono are the reported values of output quality from the corresponding multilingual and monolingual models, respectively, and (*) is the scenario in which the metric m positively correlates with general output quality (the higher, the better, e.g., MOS, MUSHRA, etc. ), as opposed to the opposite correlation (the lower, the better, e.g., MCD, WER, etc.…”

Section: Mushra (Multiple Stimuli Withmentioning

confidence: 99%

See 3 more Smart Citations

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Do¹,

Coler²,

Dijkstra³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

We provide a systematic review of past studies that use multilingual data for text-to-speech (TTS) of low-resource languages (LRLs). We focus on the strategies used by these studies for incorporating multilingual data and how they affect output speech quality. To investigate the difference in output quality between corresponding monolingual and multilingual models, we propose a novel measure to compare this difference across the included studies and their various evaluation metrics. This measure, called the Multilingual Model Effect (MLME), is found to be affected by: acoustic model architecture, the difference ratio of target language data between corresponding multilingual and monolingual experiments, the balance ratio of target language data to total data, and the amount of target language data used. These findings can act as reference for data strategies in future experiments with multilingual TTS models for LRLs. Language family classification, despite being widely used, is not found to be an effective criterion for selecting source languages.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

Section: Notable Studiesmentioning

confidence: 99%

Section: Mushra (Multiple Stimuli Withmentioning

confidence: 99%

See 2 more Smart Citations

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Do¹,

Coler²,

Dijkstra³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…On the other side of the spectrum, TTS has come a long way from sounding somewhat robotic to more natural voices. Even so, prosody can still sound off in conversations and TTS components seems underdeveloped for other languages than English, though efforts have been made for multilingual modeling for TTS (De Korte et al, 2020). Also non-verbal elements of conversational Background | 15 speech such as backchanneling and laughter are usually prerecorded for TTS systems.…”

Section: | Chaptermentioning

confidence: 99%