Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Mandeel, Ali Raheem; Al-Radhi, Mohammed Salah; Csapó, Tamás Gábor

doi:10.36244/icj.2022.3.7

Cited by 4 publications

(2 citation statements)

References 15 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Csapó et al have extensively explored the role of prosodic variability methods in a corpus-based unit selection text-to-speech system [31], and have worked on enhancing the naturalness of synthesized speech [32]. More recently, Mandeel et al [33] demonstrate successful speaker adaptation experiments using Tacotron2, a state-of-the-art text-to-speech synthesis system.…”

Section: Speaker Adaptation In Text-to-speech Synthesismentioning

confidence: 99%

Speech synthesis from intracranial stereotactic Electroencephalography using a neural vocoder

Arthur,

Csapó

2024

Infocommunications journal

View full text Add to dashboard Cite

Speech is one of the most important human biosig nals. However, only some speech production characteristics are fully understood, which are required for a successful speech based Brain-Computer Interface (BCI). A proper brain-to speech system that can generate the speech of full sentences intelligibly and naturally poses a great challenge. In our study, we used the SingleWordProduction-Dutch-iBIDS dataset, in which speech and intracranial stereotactic electroencephalography (sEEG) signals of the brain were recorded simultaneously during a single word production task. We apply deep neural networks (FC-DNN, 2D-CNN, and 3D-CNN) on the ten speakers’ data for sEEG-to-Mel spectrogram prediction. Next, we synthesize speech using the WaveGlow neural vocoder. Our objective and subjective evaluations have shown that the DNN based approaches with neural vocoder outperform the baseline linear regression model using Griffin-Lim. The synthesized samples resemble the original speech but are still not intelligible, and the results are clearly speaker dependent. In the long term, speech-based BCI applications might be useful for the speaking impaired or those having neurological disorders.

show abstract

Section: Speaker Adaptation In Text-to-speech Synthesismentioning

confidence: 99%

Speech synthesis from intracranial stereotactic Electroencephalography using a neural vocoder

Arthur,

Csapó

2024

Infocommunications journal

View full text Add to dashboard Cite

show abstract

“…This study investigated and adapted many postfilter architectures with minimal data. Using the TTS model (Tacotron2), it was found that five minutes of the target speaker's adaptation data with a low training time of checkpoint 900 (an iteration point in the training process) is enough to have a reasonable synthesized speech quality [55]. Moreover, a meta-learning algorithm was applied to the speaker adaptation method to increase the target speaker similarity and decrease the adaptation data [56].…”

Section: Speaker Adaptationmentioning

confidence: 99%

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

et al. 2023

Self Cite

View full text Add to dashboard Cite

Oil refineries have high operating expenses and are often exposed to increased asset integrity risks and functional failure. Real-time monitoring of their operations has always been critical to ensuring safety and efficiency. We proposed a novel Industrial Internet of Things (IIoT) design that employs a text-to-speech synthesizer (TTS) based on neural networks to build an intelligent extension control system. We enhanced a TTS model to achieve high inference speed by employing HiFi-GAN V3 vocoder in the acoustic model FastSpeech 2. We experimented with our system on a low resources-embedded system in a real-time environment. Moreover, we customized the TTS model to generate two target speakers (female and male) using a small dataset. We performed an ablation analysis by conducting experiments to evaluate the performance of our design (IoT connectivity, memory usage, inference speed, and output speech quality). The results demonstrated that our system Real-Time Factor (RTF) is 6.4 (without deploying the cache mechanism, which is a technique to call the previously synthesized speech sentences in our system memory). Using the cache mechanism, our proposed model successfully runs on a low-resource computational device with real-time speed (RTF equals 0.16, 0.19, and 0.29 when the memory has 250, 500, and 1000 WAV files, respectively). Additionally, applying the cache mechanism has reduced memory usage percentage from 16.3% (for synthesizing a sentence of ten seconds) to 6.3%. Furthermore, according to the objective speech quality evaluation, our TTS model is superior to the baseline TTS model.

show abstract

Enhancing End-to-End Speech Synthesis by Modeling Interrogative Sentences with Speaker Adaptation

Mandeel,

Salah Al-Radhi,

Csapó

2023

2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)

View full text Add to dashboard Cite

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Cited by 4 publications

References 15 publications

Speech synthesis from intracranial stereotactic Electroencephalography using a neural vocoder

Speech synthesis from intracranial stereotactic Electroencephalography using a neural vocoder

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

Enhancing End-to-End Speech Synthesis by Modeling Interrogative Sentences with Speaker Adaptation

Contact Info

Product

Resources

About