Chung Tran Quang scite author profile

Training a multi-speaker Text-to-Speech (TTS) model requires multiple speakers’ voices to generate an average speech model. However, the average speech synthesis model will be distorted or averaged, resulting in low quality if the new speaker’s voice has too little data to train. The existing methods require fine-tuning the model; otherwise, the model will achieve low adaptive quality. However, for synthesis voice to achieve high adaptive quality, at least thousands of fine-tuning steps are required. To solve these issues, in this paper, we propose a Vietnamese multi-speaker TTS adaptive-based technique that synthesizes high-quality speech and effectively adapts to new speakers, with two main improvements: (1) propose an Extracting Mel-Vector (EMV) architecture with three components, the Encoder–Decoder–Embedding Features, which enables complete learning of speaker features with Mel-spectrograms as input for few-shot training and (2) a continuous-learning technique called “data-distributing” preserves the new speaker’s characteristics after many training epochs. Our proposed model outperformed the baseline multi-speaker synthesis model and achieved a MOS score of 3.8/4.6 and SIM of 2.6/4 with only 1 min of the target speaker’s voice.

show abstract

Adapt-Tts: High-Quality Zero-Shot Multi-Speaker Text-to-Speech Adaptive-Based for Vietnamese

Ngoc

Quang

Luong

2023

JCC

View full text Add to dashboard Cite

Current adaptive-based speech synthesis techniques are based on two main streams: 1. Fine-tuning the model using small amounts of adaptive data, and 2. Conditionally training the entire model through a speaker embedding of the target speaker. However, both of these methods require adaptive data to appear during training, which makes the training cost to generate new voices quite expensively. In addition, the traditional TTS model uses a simple loss function to reproduce the acoustic features. However, this optimization is based on incorrect distribution assumptions leading to noisy composite audio results. We introduce the Adapt-TTS model that allows high-quality audio synthesis from a small adaptive sample without training to solve these problems. Key recommendations: 1. The Extracting Mel-vector (EMV) architecture allows for a better representation of speaker characteristics and speech style; 2. An improved zero-shot model with a denoising diffusion model (Mel-spectrogram denoiser) component allows for new voice synthesis without training with better quality (less noise). The evaluation results have proven the model's effectiveness when only needing a single utterance (1-3 seconds) of the reference speaker, the synthesis system gave high-quality synthesis results and achieved high similarity.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chung Tran Quang

A Study on Neural-Network-Based Text-to-Speech Adaptation Techniques for Vietnamese

Improving Speaker Verification in Noisy Environment Using DNN Classifier

Improving Few-Shot Multi-Speaker Text-to-Speech Adaptive-Based with Extracting Mel-Vector (EMV) for Vietnamese

Adapt-Tts: High-Quality Zero-Shot Multi-Speaker Text-to-Speech Adaptive-Based for Vietnamese

Contact Info

Product

Resources

About