Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1565
|View full text |Cite
|
Sign up to set email alerts
|

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Abstract: We provide a systematic review of past studies that use multilingual data for text-to-speech (TTS) of low-resource languages (LRLs). We focus on the strategies used by these studies for incorporating multilingual data and how they affect output speech quality. To investigate the difference in output quality between corresponding monolingual and multilingual models, we propose a novel measure to compare this difference across the included studies and their various evaluation metrics. This measure, called the Mu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 43 publications
(104 reference statements)
0
4
0
Order By: Relevance
“…For example, as mentioned in § IV-A, we choose Spanish and French if the target language is Italian. It should be noted that while recent work [42] has proposed a metric for TTS transfer learning, there is no universal metric to measure language similarity. In this study, we select the similar languages for the cross-lingual transfer based on language family defined in Glottolog [43].…”
Section: B Text-based Adaptationmentioning
confidence: 99%
“…For example, as mentioned in § IV-A, we choose Spanish and French if the target language is Italian. It should be noted that while recent work [42] has proposed a metric for TTS transfer learning, there is no universal metric to measure language similarity. In this study, we select the similar languages for the cross-lingual transfer based on language family defined in Glottolog [43].…”
Section: B Text-based Adaptationmentioning
confidence: 99%
“…Although the TTS systems in [5] are trained based on language families, the relevance of training them in this manner is not explored. A recent study [60] observes that language family classification may not be an effective basis for choosing (source) languages for training a generic TTS. However, our own experiences with multilingual TTS systems and observations in [18] find that the intelligibility of synthesised speech depends on the similarity between any target language and source language(s).…”
Section: Related Workmentioning
confidence: 99%
“…Objective metrics listed in Table 2 are divided into two aspects a: a) robustness: it quantifies how synthesized speech is correctly transcribed by listeners, such as character error rate (CER), word error rate (WER) and word information lost (WIL) [18,19]; and b) latency: it measures the latency to synthesize speech, such as the real time factor (RTF) in [12].…”
Section: Objective Metricsmentioning
confidence: 99%