2015
DOI: 10.1007/978-3-319-25789-1_4
|View full text |Cite
|
Sign up to set email alerts
|

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
4
2
1

Relationship

3
4

Authors

Journals

citations
Cited by 12 publications
(16 citation statements)
references
References 20 publications
0
16
0
Order By: Relevance
“…During the synthesis phase, voiced excitation is composed of residual excitation frames overlap-added pitch synchronously, depending on the continuous F0 [28,29,30]. After that, this voiced excitation is lowpass filtered frame by frame at the frequency given by the MVF parameter.…”
Section: Continuous Vocodermentioning
confidence: 99%
See 1 more Smart Citation
“…During the synthesis phase, voiced excitation is composed of residual excitation frames overlap-added pitch synchronously, depending on the continuous F0 [28,29,30]. After that, this voiced excitation is lowpass filtered frame by frame at the frequency given by the MVF parameter.…”
Section: Continuous Vocodermentioning
confidence: 99%
“…The RMSE of the Maximum Voiced Frequency prediction is in the range of 654-1177 Hz, indicating that for some speakers the MVF can be estimated with lower error, while for Female #1 this task was more difficult. Comparing the UTI-to-MVF prediction results with text-to-MVF prediction (within HMM-based speech synthesis [28,29]), the latter seems to be a simpler task. Predicting voicing related parameters from articulatory data can be done only through indirect relationships, as pointed out in Section 1.…”
Section: Objective Evaluationmentioning
confidence: 99%
“…20 sentences from each speaker were chosen randomly to be analyzed and synthesized with the baseline and proposed vocoders. These 60 utterances were subsequently down-sampled by a factor of 2 in order to reduce its sampling rate from 32 kHz to 16 kHz, as this is a more typical use in the early baseline vocoder [13].…”
Section: Datasetsmentioning
confidence: 99%
“…Tóth and Csapó [12] have shown that continuous F0 contour can be better approximated with HMM and deep neural network (DNN) than traditional discontinuous F0. In [13], an excitation model has been proposed which combines continuous F0 modeling with Maximum Voiced Frequency (MVF). This model has been shown to produce more natural synthesized speech for voiced sounds than traditional vocoders based on standard pitch tracking, whereas it was also found that there is a room for improvement in modeling unvoiced sounds.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation