Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis

Wang, Disong; Deng, Liqun; Zhang, Yang; Zheng, Nianzu; Yeung, Yu Ting; Chen, Xiao; Liu, Xunying; Meng, Helen

doi:10.1109/icassp39728.2021.9414870

Cited by 9 publications

(4 citation statements)

References 19 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Extending to other prosodic features, FastPitch [21] incorporates pitch control by also predicting F0 contours and FastPitchFormant [22] utilizes the predicted F0 in an excitation generator inspired by the source-filter theory in order to provide more robust and accurate pitch control. Since TTS decoders are conditioned on phoneme encoder representations, in FastSpeech 2 [23] and FCL-Taco2 [24] prosody prediction modules are introduced, which add prosodic information to these representations and are trained in a supervised manner utilizing ground truth values. In these cases, prosody information can be represented in various ways.…”

Section: Related Workmentioning

confidence: 99%

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Ellinas

Christidou²,

Vioni³

et al. 2023

Speech Communication

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Ellinas

Christidou²,

Vioni³

et al. 2023

Speech Communication

View full text Add to dashboard Cite

“…It speeds up the Parallel WaveGAN vocoder by 4x without any degradation in sound quality. Wang et al [367] propose a semi-autoregressive mode for mel-spectrogram generation, where the melspectrograms are generated in an autoregressive mode for individual phoneme and non-autoregressive mode for different phonemes.…”

Section: Adaptivementioning

confidence: 99%

“…Speedup with Domain Knowledge Domain knowledge from speech can be leveraged to speed up inference, such as linear prediction [357], multiband modeling [411,400], subscale prediction [147], multi-frame prediction [420,376,367,123], streaming synthesis [74], etc. LPCNet [357] combines digital signal processing with neural networks, by using linear prediction coefficients to calculate the next waveform and a lightweight model to predict the residual value, which can speed the inference of autoregressive waveform generation.…”

Section: Adaptivementioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models, and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

show abstract

“…The trick non-autoregressive models use to generate Mel frames in parallel is to predict the relevant features as an intermediate step and condition the independent decoding of Mels on them. This technique is now increasingly adopted for au-toregressive models as well (Wang et al, 2021) to predict features like phoneme duration that improve decoding stability avoiding alignment issues. Our method is compatible with any architecture that predicts prosodic features of pitch, energy, and duration as an intermediate step before decoding.…”

Section: Related Workmentioning

confidence: 99%

Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Kosgi¹,

Sivaprasad²,

Pedanekar³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue. Audio samples for our experiments and the code are available at: https: //emtts.github.io/tts-demo/

show abstract

Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis

Cited by 9 publications

References 19 publications

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

A Survey on Neural Speech Synthesis

Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Contact Info

Product

Resources

About