Fastpitch: Parallel Text-to-Speech with Pitch Prediction

Lancucki, Adrian

doi:10.1109/icassp39728.2021.9413889

Cited by 112 publications

(76 citation statements)

References 11 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We prepared the following GAN-based vocoders for comparison: MelGAN, Parallel WaveGAN, and HiFi-GAN. We used the official implementations of MelGAN 3 and HiFi-GAN V1 4 for reproducibility, and we implemented Parallel Wave-GAN as per the process followed by Yamamoto et al [12]. The learning rate, optimizer, and all other parameters required for training followed the reference configurations of each model.…”

Section: Model Detailsmentioning

confidence: 99%

“…For text-to-speech evaluation, we used the JDI-T [2] acoustic model with a pitch and energy predictor [3,4]. We converted text to phoneme sequences using open-sourced software 5 .…”

Section: Model Detailsmentioning

confidence: 99%

“…Vocoders have been employed in various fields such as text-tospeech [1,2,3,4], voice conversion [5], and speech-to-speech translation [6]. Neural vocoders based on deep neural networks can generate human-like voices, instead of using traditional methods that contain audible artifacts [7,8,9].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Jang¹,

Lim²,

Yoon³

et al. 2021

Preprint

View full text Add to dashboard Cite

Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multiresolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch.

show abstract

Section: Model Detailsmentioning

confidence: 99%

“…For text-to-speech evaluation, we used the JDI-T [2] acoustic model with a pitch and energy predictor [3,4]. We converted text to phoneme sequences using open-sourced software 5 .…”

Section: Model Detailsmentioning

confidence: 99%

See 1 more Smart Citation

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Jang¹,

Lim²,

Yoon³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This modification allows the variance adapter to process all information at the phoneme level. Additionally, prior work has shown that predicting variances at the phoneme level rather than at the frame level improves speech quality [33]. Second, we introduced speaker and emotion encoders to add the variance in speaker ID and emotion.…”

Section: Model Structurementioning

confidence: 99%

UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control

Kang¹,

Kim²,

Kim³

2021

Preprint

View full text Add to dashboard Cite

We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.Preprint. Under review.

show abstract

“…Despite the advantages, end-to-end attention within autoregressive models have limitations such as slow inference speed, word skipping, and reading [11,13]. As a means of remedy to this problem, non-autoregressive models are proposed for parallel generation of mel-spectrograms from text or phoneme [13][14][15][16][17][18][19]. Although the new architecture alleviates some of the drawbacks from autoregressive models, the duration aligner of non-autoregressive models still require guidance from external aligners.…”

Section: Introductionmentioning

confidence: 99%

Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

Chung

Lee

2021

Interspeech 2021

View full text Add to dashboard Cite

Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for endto-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-toframe sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to other state-of-the-art TTS models with internal and external aligners.

show abstract

Fastpitch: Parallel Text-to-Speech with Pitch Prediction

Cited by 112 publications

References 11 publications

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control

Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

Contact Info

Product

Resources

About