2014
DOI: 10.1587/transinf.e97.d.1438
|View full text |Cite
|
Sign up to set email alerts
|

Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

Abstract: SUMMARYThis paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-byframe feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2015
2015
2018
2018

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 21 publications
0
3
0
Order By: Relevance
“…Prior to the recent use of neural networks in TTS, some authors tried to implement the end-to-end concept, sometimes regarded as waveform models [21], [22].…”
Section: A End-to-end Methodsmentioning
confidence: 99%
“…Prior to the recent use of neural networks in TTS, some authors tried to implement the end-to-end concept, sometimes regarded as waveform models [21], [22].…”
Section: A End-to-end Methodsmentioning
confidence: 99%
“…(14). By back-propagating the derivative of the log likelihood function through the network, the network weights can be updated to maximize the log likelihood.…”
Section: By Assumingmentioning
confidence: 99%
“…the log spectral distortion-version of minimum generation error training (MGE-LSD) [11], statistical vocoder (STAVOCO) [12], waveform-level statistical model [13], and mel-cepstral analysis-integrated hidden Markov models (HMMs) [14]. However, there are limitations in these approaches, such as the use of spectra rather than waveforms, the use of overlapping and shifting frames as unit, and fixing decision trees [15], which represent the mapping from linguistic features to acoustic ones [16].…”
Section: Introductionmentioning
confidence: 99%