Training a multi-speaker Text-to-Speech (TTS) model requires multiple speakers’ voices to generate an average speech model. However, the average speech synthesis model will be distorted or averaged, resulting in low quality if the new speaker’s voice has too little data to train. The existing methods require fine-tuning the model; otherwise, the model will achieve low adaptive quality. However, for synthesis voice to achieve high adaptive quality, at least thousands of fine-tuning steps are required. To solve these issues, in this paper, we propose a Vietnamese multi-speaker TTS adaptive-based technique that synthesizes high-quality speech and effectively adapts to new speakers, with two main improvements: (1) propose an Extracting Mel-Vector (EMV) architecture with three components, the Encoder–Decoder–Embedding Features, which enables complete learning of speaker features with Mel-spectrograms as input for few-shot training and (2) a continuous-learning technique called “data-distributing” preserves the new speaker’s characteristics after many training epochs. Our proposed model outperformed the baseline multi-speaker synthesis model and achieved a MOS score of 3.8/4.6 and SIM of 2.6/4 with only 1 min of the target speaker’s voice.
Current adaptive-based speech synthesis techniques are based on two main streams: 1. Fine-tuning the model using small amounts of adaptive data, and 2. Conditionally training the entire model through a speaker embedding of the target speaker. However, both of these methods require adaptive data to appear during training, which makes the training cost to generate new voices quite expensively. In addition, the traditional TTS model uses a simple loss function to reproduce the acoustic features. However, this optimization is based on incorrect distribution assumptions leading to noisy composite audio results. We introduce the Adapt-TTS model that allows high-quality audio synthesis from a small adaptive sample without training to solve these problems. Key recommendations: 1. The Extracting Mel-vector (EMV) architecture allows for a better representation of speaker characteristics and speech style; 2. An improved zero-shot model with a denoising diffusion model (Mel-spectrogram denoiser) component allows for new voice synthesis without training with better quality (less noise). The evaluation results have proven the model's effectiveness when only needing a single utterance (1-3 seconds) of the reference speaker, the synthesis system gave high-quality synthesis results and achieved high similarity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.