SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Casanova, Edresson; Shulby, Christopher; Golge, Eren; Müller, Nicolas M.; Oliveira, Frederico Santos de; Cândido, Arnaldo; Soares, Anderson da Silva; Aluísio, Sandra Maria; Ponti, Moacir Antonelli

doi:10.21437/interspeech.2021-1774

Cited by 62 publications

(57 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-shot multi-speaker TTS (ZS-TTS) employs only a few seconds of speech dataset to create synthesized voices for target speakers. SC-GlowTTS is a novel model (GlowTTS and HiFi-GAN) to enhance the likeness to the target speaker using ZS-TTS [8]. The all pass warp (APW) has been integrated into neural network frameworks and investigated in ZS-TTS [35].…”

Section: Related Workmentioning

confidence: 99%

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Mandeel

Al-Radhi

Csapó

2022

Multimed Tools Appl

View full text Add to dashboard Cite

This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis. In purposes that demand low computational complexity, conventional vocoder-based statistical parametric speech synthesis can be preferable. While capable of remarkable naturalness, recent neural vocoders nonetheless fall short of the criteria for real-time synthesis. We investigate our former continuous vocoder, in which the excitation is characterized employing two one-dimensional parameters: Maximum Voiced Frequency and continuous fundamental frequency (F0). We show that an average voice can be trained for deep neural network-based TTS utilizing data from nine English speakers. We did speaker adaptation experiments for each target speaker with 400 utterances (approximately 14 minutes). We showed an apparent enhancement in the quality and naturalness of synthesized speech compared to our previous work by utilizing the recurrent neural network topologies. According to the objective studies (Mel-Cepstral Distortion and F0 correlation), the quality of speaker adaptation using Continuous Vocoder-based DNN-TTS is slightly better than the WORLD Vocoder-based baseline. The subjective MUSHRA-like test results also showed that our speaker adaptation technique is almost as natural as the WORLD vocoder using Gated Recurrent Unit and Long Short Term Memory networks. The proposed vocoder, being capable of real-time synthesis, can be used for applications which need fast synthesis speed.

show abstract

Section: Related Workmentioning

confidence: 99%

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Mandeel

Al-Radhi

Csapó

2022

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…The term "Audio DeepFake" covers solutions that can create artificially modified speech. These solutions can either generate new utterances using Text-To-Speech (TTS) [1]- [3] and Voice Cloning [4], [5] methods, or modify existing utterances and therefore change it to someone else -Voice Conversion [6], [7]. The more recent architectures not only provide a synthesis of a speech but also focus on proper intonation, stress, and rhythm.…”

Section: Introductionmentioning

confidence: 99%

SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection

Kawa¹,

Plata²,

Syga³

2022

Preprint

View full text Add to dashboard Cite

Audio DeepFakes are utterances generated with the use of deep neural networks. They are highly misleading and pose a threat due to use in fake news, impersonation, or extortion. In this work, we focus on increasing accessibility to the audio DeepFake detection methods by providing SpecRNet, a neural network architecture characterized by a quick inference time and low computational requirements. Our benchmark shows that SpecRNet, requiring up to about 40% less time to process an audio sample, provides performance comparable to LCNN architecture -one of the best audio DeepFake detection models. Such a method can not only be used by online multimedia services to verify a large bulk of content uploaded daily but also, thanks to its low requirements, by average citizens to evaluate materials on their devices. In addition, we provide benchmarks in three unique settings that confirm the correctness of our model. They reflect scenarios of low-resource datasets, detection on short utterances and limited attacks benchmark in which we take a closer look at the influence of particular attacks on given architectures.

show abstract

“…They are composed of two complementary systems: automatic speech recognition (ASR), and text-to-speech (TTS). In case of a TTS model, it receives text (or phonemes) as input and produces synthesized speech as output with desired features like emotion, intonation, rhythm, etc [3].…”

Section: Introductionmentioning

confidence: 99%

Performance Comparison of TTS Models for Brazilian Portuguese to Establish a Baseline

Lobato¹,

Farias²,

Castañeda³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper compares the performance of three text-to-speech (TTS) models released from June 2021 to January 2022 in order to establish a baseline for Brazilian Portuguese. Those models were trained using dataset for Brazilian Portuguese. The experimental setup considers tts-portuguese dataset to fine-tune the following TTS models: VITS end-to-end model; glowtts and gradtts acoustic models both using hifi-gan vocoder. Performance metrics are arranged into objective and subjective metrics. As subjective metrics, the naturalness and intelligibility are measured based on the mean opinion score (MOS). Results shows that gradtts+hifigan model achieved naturalness of 4.07 MOS, close to performance of current commercial models.

show abstract

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Cited by 62 publications

References 10 publications

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection

Performance Comparison of TTS Models for Brazilian Portuguese to Establish a Baseline

Contact Info

Product

Resources

About