Additive Modeling of English F0 Contour for Speech Synthesis

Sakai, Shinsuke

doi:10.1109/icassp.2005.1415104

Cited by 15 publications

(23 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prediction errors showed in table 1 indicate that the TTS results obtained using our clustering technique are comparable with other approaches found in the bibliography (see [14] for a ranking). Informal listening tests have been done to assess the goodness of the synthetic intonation.…”

Section: Resultssupporting

confidence: 64%

Mining Intonation Corpora Using Knowledge Driven Sequential Clustering

Mancebo

Cardeñoso-Payo

2006

Advances in Artificial Intelligence - IBERAMIA-SBIA 2006

View full text Add to dashboard Cite

Abstract. This work presents a mining methodology designed to cope with the usual data scarcity problems of intonation corpora which arises from the high variability of prosodic information. The methodology is an adaptation of a basic agglomerative clustering technique, guided by a set of domain constraints. The peculiarities of the text-to-speech intonation modelling problem are considered in order to fix the initial configuration of the cluster and the criteria to merge classes and stopping their splitting. The scarcity problem poses the need to apply a sequential selection mechanism of prosodic features, in order to obtain the initial set of classes in the cluster. A searching strategy to select the best class among a set of alternatives is proposed, which provides useful prediction models for accurate synthetic intonation. Visualization of final classes by means of a modified decision tree brings graphical cues about contrastable prosodic information of the intonation corpus.

show abstract

Section: Resultssupporting

confidence: 64%

Mining Intonation Corpora Using Knowledge Driven Sequential Clustering

Mancebo

Cardeñoso-Payo

2006

Advances in Artificial Intelligence - IBERAMIA-SBIA 2006

View full text Add to dashboard Cite

show abstract

“…Fundamental frequency (F0) is the most important acoustic correlate of tone in spoken Mandarin. There have been numerous studies on F0 modeling [1][2][3][4][5][6][7][8]. These studies are roughly around three issues.…”

Section: Introductionmentioning

confidence: 99%

Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech

Soong

et al. 2008

2008 6th International Symposium on Chinese Spoken Language Processing

View full text Add to dashboard Cite

Abstract-This paper models F0 curves with discrete cosine transform (DCT) representations on both syllable-level tone and phrase-level intonation for Chinese Mandarin speech. Decision trees growing with maximum likelihood (ML) and stopping with minimum description length (MDL) are used to cluster very rich context-dependent DCT models into generalized ones to predict unseen contexts in test robustly. Additionally, we propose to generate Mandarin tone contours by jointly optimizing F0 contours of syllable and phrase in ML sense. Experimental results on speaker-dependent continuous and speakerindependent isolated speech corpora show that the proposed approach can be able to generate F0 contour with high correlation coefficients of 0.92 and 0.82 respectively, measured between the original and generated F0.

show abstract

“…Furthermore, in this structure, each training data sample contributes to modeling multiple mean vectors and covariance matrices. Many papers applied the additive structure just for F0 modeling [37][38][39][40]. Authors in [37] proposed an additive structure with multiple decision trees for mean vectors and a single tree for variance terms.…”

Section: Introductionmentioning

confidence: 99%

“…Acoustic modeling with contextual additive structure has also been proposed to represent dependencies between contextual factors and acoustic features more precisely [19,20,23,32,[36][37][38][39][40]. In this structure, acoustic trajectories are considered to be a sum of independent acoustic components which have different context dependencies (different decision trees have to be trained for those components).…”

Section: Introductionmentioning

confidence: 99%

“…In [40], multiple additive decision trees are also employed, but they train this structure using minimum generation error (MGE) criterion. Sakai [38] defines an additive model with three distinct layers, namely intonational phrase, word-level, and pitchaccent layers. All of these components were trained simultaneously using a regularized least square error criterion.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Khorram

Sameti

Bahmaninezhad

et al. 2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.

show abstract

Additive Modeling of English F0 Contour for Speech Synthesis

Cited by 15 publications

References 18 publications

Mining Intonation Corpora Using Knowledge Driven Sequential Clustering

Mining Intonation Corpora Using Knowledge Driven Sequential Clustering

Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Contact Info

Product

Resources

About