Speech parameter generation algorithms for HMM-based speech synthesis

Tokuda, Keiichi; Yoshimura, Takeshi; Masuko, Takashi; Kobayashi, Takao; Kitamura, Tadashi

doi:10.1109/icassp.2000.861820

Cited by 723 publications

(555 citation statements)

References 11 publications

Supporting

Mentioning

538

Contrasting

Unclassified

Order By: Relevance

“…Thus it introduces some level of discontinuity. To obtain a smooth trajectory of spectral vectors Maximum Likelihood Parameter Generation (MLPG) [10] is used.…”

Section: Introductionmentioning

confidence: 99%

Voice conversion using Artificial Neural Networks

Desai

Raghavendra

Yegnanarayana

et al. 2009

2009 IEEE International Conference on Acoustics, Speech and Signal Processing

183

116

View full text Add to dashboard Cite

In this paper, we propose to use Artificial Neural Networks (ANN) for voice conversion. We have exploited the mapping abilities of ANN to perform mapping of spectral features of a source speaker to that of a target speaker. A comparative study of voice conversion using ANN and the state-of-the-art Gaussian Mixture Model (GMM) is conducted. The results of voice conversion evaluated using subjective and objective measures confirm that ANNs perform better transformation than GMMs and the quality of the transformed speech is intelligible and has the characteristics of the target speaker.Index Terms-Voice conversion, Artificial Neural Networks, Gaussian Mixture Model.

show abstract

“…Thus it introduces some level of discontinuity. To obtain a smooth trajectory of spectral vectors Maximum Likelihood Parameter Generation (MLPG) [10] is used.…”

Section: Introductionmentioning

confidence: 99%

Voice conversion using Artificial Neural Networks

Desai

Raghavendra

Yegnanarayana

et al. 2009

2009 IEEE International Conference on Acoustics, Speech and Signal Processing

183

116

View full text Add to dashboard Cite

show abstract

“…⊤ denotes the joint static and dynamic feature sequence, W is a transform matrix to extend the static feature sequence into the static and dynamic feature sequence [15]. To avoid the complicated formula ∑ m in Eq.…”

Section: Batch-type Prediction Processmentioning

confidence: 99%

A Vibration Control Method of an Electrolarynx Based on Statistical <i>F</i><sub>0</sub> Pattern Prediction

Tanaka

Toda

Nakamura

2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARY This paper presents a novel speaking aid system to help laryngectomees produce more naturally sounding electrolaryngeal (EL) speech. An electrolarynx is an external device to generate excitation signals, instead of vibration of the vocal folds. Although the conventional EL speech is quite intelligible, its naturalness suffers from the unnatural fundamental frequency (F 0 ) patterns of the mechanically generated excitation signals. To improve the naturalness of EL speech, we have proposed EL speech enhancement methods using statistical F 0 pattern prediction. In these methods, the original EL speech recorded by a microphone is presented from a loudspeaker after performing the speech enhancement. These methods are effective for some situation, such as telecommunication, but it is not suitable for face-to-face conversation because not only the enhanced EL speech but also the original EL speech is presented to listeners. In this paper, to develop an EL speech enhancement also effective for face-to-face conversation, we propose a method for directly controlling F 0 patterns of the excitation signals to be generated from the electrolarynx using the statistical F 0 prediction. To get an "actual feel" of the proposed system, we also implement a prototype system. By using the prototype system, we find latency issues caused by a real-time processing. To address these latency issues, we furthermore propose segmental continuous F 0 pattern modeling and forthcoming F 0 pattern modeling. With evaluations through simulation, we demonstrate that our proposed system is capable of effectively addressing the issues of latency and those of electrolarynx in term of the naturalness.

show abstract

“…On the other hands, the frame spectral feature (i.e., MGC) vector sequence is generated by an HMM parameter generation algorithm [52] given with the CDHMMs, the estimated state durations, and the contextual information (i.e., Iðs nþ1 n−1 Þ; Fðs nþ1 n−1 Þ; p n ; q n ; r n ; and B n n−1 ). It is noted that the energy level of each syllable CD-HMM (i.e., an Initial CD-HMM connecting with a Final CD-HMM) is scaled to se 0 n before executing the parameter generation algorithm so as to make the generated energy contour smooth and approximate the desired syllable energy levels.…”

Section: Speech Synthesismentioning

confidence: 99%

A parametric prosody coding approach for Mandarin speech using a hierarchical prosodic model

Chiang

2018

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

In this paper, a novel parametric prosody coding approach for Mandarin speech is proposed. It employs a hierarchical prosodic model (HPM) as a prosody-generating model in the encoder to analyze the speech prosody of the input utterance to obtain a parametric representation of four prosodic-acoustic features of syllable pitch contour, syllable duration, syllable energy level, and syllable-juncture pause duration for encoding. In the decoder, the four prosodic-acoustic features are reconstructed by a synthesis operation using the decoded HPM parameters. The reconstructed prosodic features are lastly used in an HMM-based speech synthesizer to generate the reconstructed speech. Objective and subjective evaluations showed that the proposed prosody coding approach encoded speech with better quality and lower data rate than the conventional segment-based coding scheme with vector or scalar quantization approach did. The reconstructed speech encoded by the proposed approach has good quality at low data rates of 81.4 and 72.7 bps for speaker-dependent and speaker-independent tasks, respectively. An application of the proposed prosody coding approach to speaking rate conversion by directly changing the HPM parameters to those of a different speaking rate is also illustrated. An informal listening test confirmed that both converted speeches of high and low speaking rate sounded very smooth.

show abstract

Speech parameter generation algorithms for HMM-based speech synthesis

Cited by 723 publications

References 11 publications

Voice conversion using Artificial Neural Networks

Voice conversion using Artificial Neural Networks

A Vibration Control Method of an Electrolarynx Based on Statistical <i>F</i><sub>0</sub> Pattern Prediction

A parametric prosody coding approach for Mandarin speech using a hierarchical prosodic model

Contact Info

Product

Resources

About