ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414870
|View full text |Cite
|
Sign up to set email alerts
|

Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis

Abstract: Sequence-to-sequence (seq2seq) learning has greatly improved text-to-speech (TTS) synthesis performance, but effective implementation on resource-restricted devices remains challenging as seq2seq models are usually computationally expensive and memory intensive. To achieve fast inference speed and small model size while maintain high-quality speech, we propose FCL-taco2, a Fast, Controllable and Lightweight (FCL) TTS model based on Tacotron2. FCL-taco2 adopts a novel semi-autoregressive (SAR) mode for phoneme … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 19 publications
(26 reference statements)
0
4
0
Order By: Relevance
“…Extending to other prosodic features, FastPitch [21] incorporates pitch control by also predicting F0 contours and FastPitchFormant [22] utilizes the predicted F0 in an excitation generator inspired by the source-filter theory in order to provide more robust and accurate pitch control. Since TTS decoders are conditioned on phoneme encoder representations, in FastSpeech 2 [23] and FCL-Taco2 [24] prosody prediction modules are introduced, which add prosodic information to these representations and are trained in a supervised manner utilizing ground truth values. In these cases, prosody information can be represented in various ways.…”
Section: Related Workmentioning
confidence: 99%
“…Extending to other prosodic features, FastPitch [21] incorporates pitch control by also predicting F0 contours and FastPitchFormant [22] utilizes the predicted F0 in an excitation generator inspired by the source-filter theory in order to provide more robust and accurate pitch control. Since TTS decoders are conditioned on phoneme encoder representations, in FastSpeech 2 [23] and FCL-Taco2 [24] prosody prediction modules are introduced, which add prosodic information to these representations and are trained in a supervised manner utilizing ground truth values. In these cases, prosody information can be represented in various ways.…”
Section: Related Workmentioning
confidence: 99%
“…It speeds up the Parallel WaveGAN vocoder by 4x without any degradation in sound quality. Wang et al [367] propose a semi-autoregressive mode for mel-spectrogram generation, where the melspectrograms are generated in an autoregressive mode for individual phoneme and non-autoregressive mode for different phonemes.…”
Section: Adaptivementioning
confidence: 99%
“…Speedup with Domain Knowledge Domain knowledge from speech can be leveraged to speed up inference, such as linear prediction [357], multiband modeling [411,400], subscale prediction [147], multi-frame prediction [420,376,367,123], streaming synthesis [74], etc. LPCNet [357] combines digital signal processing with neural networks, by using linear prediction coefficients to calculate the next waveform and a lightweight model to predict the residual value, which can speed the inference of autoregressive waveform generation.…”
Section: Adaptivementioning
confidence: 99%
“…The trick non-autoregressive models use to generate Mel frames in parallel is to predict the relevant features as an intermediate step and condition the independent decoding of Mels on them. This technique is now increasingly adopted for au-toregressive models as well (Wang et al, 2021) to predict features like phoneme duration that improve decoding stability avoiding alignment issues. Our method is compatible with any architecture that predicts prosodic features of pitch, energy, and duration as an intermediate step before decoding.…”
Section: Related Workmentioning
confidence: 99%