ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413889
|View full text |Cite
|
Sign up to set email alerts
|

Fastpitch: Parallel Text-to-Speech with Pitch Prediction

Abstract: We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
65
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 112 publications
(76 citation statements)
references
References 11 publications
(15 reference statements)
1
65
0
Order By: Relevance
“…We prepared the following GAN-based vocoders for comparison: MelGAN, Parallel WaveGAN, and HiFi-GAN. We used the official implementations of MelGAN 3 and HiFi-GAN V1 4 for reproducibility, and we implemented Parallel Wave-GAN as per the process followed by Yamamoto et al [12]. The learning rate, optimizer, and all other parameters required for training followed the reference configurations of each model.…”
Section: Model Detailsmentioning
confidence: 99%
See 2 more Smart Citations
“…We prepared the following GAN-based vocoders for comparison: MelGAN, Parallel WaveGAN, and HiFi-GAN. We used the official implementations of MelGAN 3 and HiFi-GAN V1 4 for reproducibility, and we implemented Parallel Wave-GAN as per the process followed by Yamamoto et al [12]. The learning rate, optimizer, and all other parameters required for training followed the reference configurations of each model.…”
Section: Model Detailsmentioning
confidence: 99%
“…For text-to-speech evaluation, we used the JDI-T [2] acoustic model with a pitch and energy predictor [3,4]. We converted text to phoneme sequences using open-sourced software 5 .…”
Section: Model Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…This modification allows the variance adapter to process all information at the phoneme level. Additionally, prior work has shown that predicting variances at the phoneme level rather than at the frame level improves speech quality [33]. Second, we introduced speaker and emotion encoders to add the variance in speaker ID and emotion.…”
Section: Model Structurementioning
confidence: 99%
“…Despite the advantages, end-to-end attention within autoregressive models have limitations such as slow inference speed, word skipping, and reading [11,13]. As a means of remedy to this problem, non-autoregressive models are proposed for parallel generation of mel-spectrograms from text or phoneme [13][14][15][16][17][18][19]. Although the new architecture alleviates some of the drawbacks from autoregressive models, the duration aligner of non-autoregressive models still require guidance from external aligners.…”
Section: Introductionmentioning
confidence: 99%