Statistical Parametric Speech Synthesis

Black,; Zen,; Tokuda,

doi:10.1109/icassp.2007.367298

Cited by 209 publications

(101 citation statements)

References 92 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The HMM-based speech synthesis system (HTS) [1] models spectrum, F0 and duration simultaneously in the unified framework of HSMM. In the training stage, the output vector of the HSMM consists of a spectrum part and an F0 part.…”

Section: Statistical Speech Synthesismentioning

confidence: 99%

“…Recent advances in the field of statistical speech synthesis [1], have considerably reduced the gap between basic techniques used in automatic speech recognition (ASR) and text to speech (TTS). Feature types, feature dimensionality, duration and pitch modeling are a few of the key differences between the recognition and synthesis models [2].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VTLN adaptation for statistical speech synthesis

Saheer

Garner

Dines

et al. 2010

2010 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features.

show abstract

Section: Statistical Speech Synthesismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

VTLN adaptation for statistical speech synthesis

Saheer

Garner

Dines

et al. 2010

2010 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade [1,2]. It is mainly due to SPSS advantages over traditional concatenative speech synthesis approaches; these advantages include the flexibility to change voice characteristics [3][4][5], multilingual support [6][7][8], coverage of acoustic space [1], small footprint [1], and robustness [4,9].…”

Section: Introductionmentioning

confidence: 99%

“…Every SPSS system consists of two distinct phases, namely training and synthesis [1,2]. In the training phase, first acoustic and contextual factors are extracted for the whole training database using a vocoder [12,29,30] and a natural language pre-processor.…”

Section: Introductionmentioning

confidence: 99%

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Khorram

Sameti

Bahmaninezhad

et al. 2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.

show abstract

“…By using parameter generation algorithm [2], spectral and excitation parameters are generated from the sentence HMM. Finally, by using a synthesis filter, speech is synthesized from the generated spectral and excitation parameters [7], [16] and [17]. Spectral and excitation parameters are needed for any synthesis filter to generate speech waveforms so both must be modeled by HMMs.…”

Section: Introductionmentioning

confidence: 99%

The Effect of Speech Features and HMM Parameters on the Quality of HMM Based Arabic Synthesis System

Barakat¹,

Gadallah²

2010

IJCEE

View full text Add to dashboard Cite

Abstract-A statistical parametric speech synthesis system based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach the system simultaneously models spectrum, excitation, and duration of speech using context-dependent HMMs and generates speech waveforms from the HMMs themselves. In this paper, the HMM-based speech synthesis system is applied to Arabic language using low size unsegmented speech training database. This technique shows that the resulting HMM set has the advantage of being small (can be less than 1MB) which is very important for communication applications. The basic contribution in this paper is to justify both the HMM parameters and the speech features to be suitable for using small speech database to get the highest quality. The motivation of this work is the starvation of the Arabic speech database. Experiments show that using Mel-cepstral coefficients as spectral parameters of speech waveforms for training gives better results than using LPC or PARCOR coefficients. Also, investigation tests show that increasing the context-dependent models length and the number of Gaussian Mixtures with this relatively small size training data has the disadvantage of poor generalization of HMMs that leads to perceivable discontinuities and clicks in the synthesized speech.

show abstract

Statistical Parametric Speech Synthesis

Cited by 209 publications

References 92 publications

VTLN adaptation for statistical speech synthesis

VTLN adaptation for statistical speech synthesis

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

The Effect of Speech Features and HMM Parameters on the Quality of HMM Based Arabic Synthesis System

Contact Info

Product

Resources

About