Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-866
|View full text |Cite
|
Sign up to set email alerts
|

FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis

Abstract: Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 19 publications
(24 reference statements)
0
3
0
Order By: Relevance
“…Extending to other prosodic features, FastPitch [21] incorporates pitch control by also predicting F0 contours and FastPitchFormant [22] utilizes the predicted F0 in an excitation generator inspired by the source-filter theory in order to provide more robust and accurate pitch control. Since TTS decoders are conditioned on phoneme encoder representations, in FastSpeech 2 [23] and FCL-Taco2 [24] prosody prediction modules are introduced, which add prosodic information to these representations and are trained in a supervised manner utilizing ground truth values.…”
Section: Related Workmentioning
confidence: 99%
“…Extending to other prosodic features, FastPitch [21] incorporates pitch control by also predicting F0 contours and FastPitchFormant [22] utilizes the predicted F0 in an excitation generator inspired by the source-filter theory in order to provide more robust and accurate pitch control. Since TTS decoders are conditioned on phoneme encoder representations, in FastSpeech 2 [23] and FCL-Taco2 [24] prosody prediction modules are introduced, which add prosodic information to these representations and are trained in a supervised manner utilizing ground truth values.…”
Section: Related Workmentioning
confidence: 99%
“…To evaluate the responsiveness of the models to the variance information, we measure f0 frame error rate (FFE) [21] between the pitch values provided to the decoder and the pitch values extracted from the generated audio samples. All audio samples are generated by adjusting the pitch values in a semitone unit as follows [10]:…”
Section: Pitch Responsivenessmentioning
confidence: 99%
“…As a result, they can generate highquality and diverse speech samples compared to the previous non-AR TTS models trained with the MSE loss [6,7]. The second type of the methods (Type-II) is to solve a task of TTS by dividing it into simpler two tasks based on variance information such as pitch or energy [8,9,10]: (1) text conditioned variance modeling; (2) text and variance information conditioned speech generation. Then, by using ground-truth variance information when learning the two tasks, Type-II models achieve faster training convergence and higher speech quality.…”
Section: Introductionmentioning
confidence: 99%