2021
DOI: 10.1186/s13636-021-00225-4
|View full text |Cite
|
Sign up to set email alerts
|

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Abstract: Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 13 publications
0
2
0
Order By: Relevance
“…Clean Lombard speech is a speech under the Lombard effect but without noise in the audio. In our experiment, the clean Lombard speech is a synthetic Lombard speech generated by modifying the prosody of normal speech (intensity, pitch, duration) into Lombard speech using SoX audio manipulation toolkit [41], [42]. Noises were not included in the resulting audio.…”
Section: B Training Methodsmentioning
confidence: 99%
“…Clean Lombard speech is a speech under the Lombard effect but without noise in the audio. In our experiment, the clean Lombard speech is a synthetic Lombard speech generated by modifying the prosody of normal speech (intensity, pitch, duration) into Lombard speech using SoX audio manipulation toolkit [41], [42]. Noises were not included in the resulting audio.…”
Section: B Training Methodsmentioning
confidence: 99%
“…Deep learning however relies heavily on a substantial quantity of training data [34], [35] such that it was stated in [33], [36] that DNN is not a suitable technique for TTS in low-resource languages. In [37] however, techniques such as monolingual transfer learning, cross-lingual transfer learning, multi-speaker models, multilingual models, and data augmentation have been proposed as means of augmenting TTS for low-resource languages.…”
Section: Text To Speech Translationmentioning
confidence: 99%
“…Cross-lingual transfer learning and data augmentation approach for low resource TTS were proposed in (Byambadorj et al, 2021). The spectrogram prediction network was trained using crosslingual transfer learning (TL) from high resource language, data augmentation by varying parameters like pitch and speed, and a combination of two approaches.…”
Section: Related Workmentioning
confidence: 99%