Jinlong Xue scite author profile

In recent years, neural network based methods for multispeaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate highquality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-ofthe-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

show abstract

Silicon substrate diamond film detector for gamma dose rate measurement in a high radiation environment

Xue

Hou

Niu

et al. 2022

Diamond and Related Materials

View full text Add to dashboard Cite

M²-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis

Xue

Deng

Wang

et al. 2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jinlong Xue

CQDS preluded carbon-incorporated 3D burger-like hybrid ZnO enhanced visible-light-driven photocatalytic activity and mechanism implication

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Silicon substrate diamond film detector for gamma dose rate measurement in a high radiation environment

M²-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis

Contact Info

Product

Resources

About

Jinlong Xue

CQDS preluded carbon-incorporated 3D burger-like hybrid ZnO enhanced visible-light-driven photocatalytic activity and mechanism implication

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Silicon substrate diamond film detector for gamma dose rate measurement in a high radiation environment

M2-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis

Contact Info

Product

Resources

About

M²-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis