Pitch pattern generation using multispace probability distribution HMM

Masuko, Takashi; Tokuda, Keiichi; Miyazaki, Noboru; Kobayashi, Takao

doi:10.1002/scj.1133

Cited by 13 publications

(19 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In particular, the domains of the dynamic F0 features (normally the 1 st and 2 nd order derivatives of the static F0 observations, referred to as delta and deltadelta features, respectively) are also discontinuous. Hence, for frames at the boundaries between voiced and unvoiced regions, they can not be directly calculated and are therefore defined as NULL in the most widely used implementation of MSDHMM, i.e., these frames are regarded as unvoiced as far as the dynamic features are concerned [11]. This means that near a boundary, the static F0 feature can be a real value whilst the delta and delta-delta features are NULL.…”

Section: Discontinuous F0 Modellingmentioning

confidence: 99%

“…(2) 4 . Hence, the state output distribution of the full F0 observation is a product of the output distributions of the static and dynamic streams [11].…”

Section: Discontinuous F0 Modellingmentioning

confidence: 99%

“…Due to the discontinuity at the boundary between voiced and unvoiced regions, dynamic features can not be easily calculated. Hence, in the most widely used MSDHMM implementation, separate streams are normally used to model static and dynamic features [11]. This results in redundant voicing probability parameters which may not only limit the number of clustered states, but also weaken the correlation modelling between static and dynamic features.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

Young

2011

IEEE Trans. Audio Speech Lang. Process.

102

View full text Add to dashboard Cite

Abstract-The modelling of fundamental frequency, or F0, in HMM-based speech synthesis is a critical factor in delivering speech which is both natural and accurately conveys all of the many nuances of the message. However, F0 modelling is difficult because F0 values are normally considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. F0 is therefore a discontinuous function of time. multi-space probability distribution HMM (MSDHMM) is a widely used solution to this problem. The MSDHMM essentially uses a joint distribution of discrete voicing labels and the discontinuous F0 observations. However, due to the discontinuity assumption, the MSDHMM provides a rather weak F0 trajectory model. In this paper, F0 is viewed as being a continuous function of time and this is achieved by assuming that F0 can be observed within unvoiced regions as well as voiced regions. This provides a continuous F0 data stream which can be modelled by standard HMMs. Voicing labels are modelled either implicitly or explicitly in order to perform voicing classification and a globally tied distribution (GTD) technique is used to achieve robust F0 estimation. Both objective measures and subjective listening tests demonstrate that continuous F0 modelling yields better synthesized F0 trajectories and significant improvements to the naturalness of synthesised speech compared to using the MSDHMM model.

show abstract

Section: Discontinuous F0 Modellingmentioning

confidence: 99%

“…(2) 4 . Hence, the state output distribution of the full F0 observation is a product of the output distributions of the static and dynamic streams [11].…”

Section: Discontinuous F0 Modellingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

Young

2011

IEEE Trans. Audio Speech Lang. Process.

102

View full text Add to dashboard Cite

show abstract

“…Though there are some exceptions [8,4], the most widely used method is to model static and dynamic features in separate streams [9]. This common implementation limits the power of HMMs to model the F0 trajectory.…”

Section: Comparison Of F0 Modelling Approaches For Hmm Based Speech Smentioning

confidence: 99%

Joint modelling of voicing label and continuous F0 for HMM based speech synthesis

Young

2011

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Fundamental frequency, or F0 is critical for high quality speech synthesis in HMM based speech synthesis. Traditionally, F0 values are considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. Multi-space distribution HMM (MSDHMM) has been used for modelling the discontinuous F0. Recently, a continuous F0 modelling framework has been proposed and shown to be effective, where continuous F0 observations are assumed to always exist and voicing labels are explicitly modelled by an independent stream. In this paper, a refined continuous F0 modelling approach is proposed. Here, F0 values are assumed to be dependent on voicing labels and both are jointly modelled in a single stream. Due to the enforced dependency, the new method can effectively reduce the voicing classification error. Subjective listening tests also demonstrate that the new approach can yield significant improvements on the naturalness of the synthesised speech. A dynamic random unvoiced F0 generation method is also investigated. Experiments show that it has significant effect on the quality of synthesised speech.

show abstract

“…As a result, subsequent F0 modeling and generation suffer. Standard HMM-based TTS [2] uses multi-space distribution (MSD) to model and generate discontinuous F0 trajectories [18]. Faulty voicing decisions resulting from the F0 extraction phase will cause the deteriorately trained MSD-HMMs to synthesize voiced frames as unvoiced, resulting in hoarse speech, or to synthesize unvoiced frames as voiced, resulting in buzzy speech [19].…”

Section: Introductionmentioning

confidence: 99%

F0 Parameterization of Glottalized Tones in HMM-Based Speech Synthesis for Hanoi Vietnamese

Ninh

Yamashita

2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYA conventional HMM-based speech synthesis system for Hanoi Vietnamese often suffers from hoarse quality due to incomplete F0 parameterization of glottalized tones. Since estimating F0 from glottalized waveform is rather problematic for usual F0 extractors, we propose a pitch marking algorithm where pitch marks are propagated from regular regions of a speech signal to glottalized ones, from which complete F0 contours for the glottalized tones are derived. The proposed F0 parameterization scheme was confirmed to significantly reduce the hoarseness whilst slightly improving the tone naturalness of synthetic speech by both objective and listening tests. The pitch marking algorithm works as a refinement step based on the results of an F0 extractor. Therefore, the proposed scheme can be combined with any F0 extractor.

show abstract

Pitch pattern generation using multispace probability distribution HMM

Cited by 13 publications

References 12 publications

Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

Joint modelling of voicing label and continuous F0 for HMM based speech synthesis

F0 Parameterization of Glottalized Tones in HMM-Based Speech Synthesis for Hanoi Vietnamese

Contact Info

Product

Resources

About