Recent improvements to the IBM trainable speech synthesis system

Eide, Ellen; Aaron, A.; Bakis, Raimo; Cohen, Reuven; Donovan, Robert E.; Hamza, Wael; Mathes, Timothy K.; Picheny, Michael; Polkosky, Melanie D.; Smith, M.; Viswanathan, Mahesh

doi:10.1109/icassp.2003.1198879

Cited by 33 publications

(24 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The interest of intonation models in TTS has lessened due to the use of synthesis techniques based on the speech-unit selection (Eide et al, 2003) or HMM synthesis (Tokuda et al, 2000). Nevertheless, the prediction of realistic target F0 contours is still useful for guiding the search of units in the corpus (Rodríguez and Campillo, 2006;Eide et al, 2003).…”

Section: Quality Of Synthetic Contoursmentioning

confidence: 99%

“…Nevertheless, the prediction of realistic target F0 contours is still useful for guiding the search of units in the corpus (Rodríguez and Campillo, 2006;Eide et al, 2003). A list of dictionariesD The ordered list of dictionaries provides a way to build a graph of classes which conveys schematic visual information about the intonation patterns found in the corpus and their corresponding labels of prosodic features (see Appendix B for an explanation and section 3.5 for a more detailed discussion of the use of this graph in the experiments for Spanish language).…”

Section: Quality Of Synthetic Contoursmentioning

confidence: 99%

“…As for the function-form correspondence, several proposals can be found in the state of the art on how to obtain the right relation between the acoustic parameters (representing the F0 contours) and the prosodic features: stochastic models (Veronis et al, 1998), neural networks (Holm, 2003;Sakurai et al, 2003), linear regression (Sproat and Olive, 1995) and decision and regression trees (Lee and Oh, 2001;Taylor, 2000;Eide et al, 2003). In all these methodologies, two main limitations arise: lack of robustness to cope with data scarcity training conditions and limited capabilities to provide experimentally contrastable prosodic information.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Applying data mining techniques to corpus based prosodic modeling

Mancebo

Cardeñoso-Payo

2007

Speech Communication

View full text Add to dashboard Cite

Section: Quality Of Synthetic Contoursmentioning

confidence: 99%

Section: Quality Of Synthetic Contoursmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Applying data mining techniques to corpus based prosodic modeling

Mancebo

Cardeñoso-Payo

2007

Speech Communication

View full text Add to dashboard Cite

“…Corpus-based concatenative approach to speech synthesis has been widely explored in the research community in recent years [1,2,3]. Intonation modeling, or generation of fundamental frequency (F0) contour plays a crucial role in synthesizing natural sounding speech from input text.…”

Section: Introductionmentioning

confidence: 99%

“…Target F0 contour is generated using the features extracted from input text, and it is used either to modify the pitch of selected synthesis units, or in the unit selection where the discrepancies between target F0 contour and the F0 values of the synthesis units to be selected are attempted to be made as small as possible in the overall cost minimization through a search in the space of all available synthesis units. There has been a number of efforts in the context of F0 contour generation for English speech synthesis in the past decade, such as dynamical system [4], linear regression-based approach [5], combination of parametric models with regression trees [6,7], and the combination of regression trees and kernel smoother [2].…”

Section: Introductionmentioning

confidence: 99%

Additive Modeling of English F0 Contour for Speech Synthesis

Sakai

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.

View full text Add to dashboard Cite

In this paper, we present an approach to fundamental frequency contour modeling of English for speech synthesis, based on a statistical learning technique called Additive Models that was successfully applied to the modeling of Japanese F0 contour previously.In an attempt to model English F0 contour, we defined a threelayer additive model consisting of an intonational phrase component, a word-level component representing lexical stress types, and a pitch-accent component related to accented syllables. These component functions are estimated simultaneously using a backfitting algorithm derived from a regularized least-squares error criterion specified on the model with regard to the training data. The proposed method was trained and tested using the widely used ToBIlabeled speech corpus and promising results were obtained.

show abstract

Unit Selection for Speech Synthesis Based on Acoustic Criteria

Rouibia

Rosec

Moudenc

2005

Text, Speech and Dialogue

View full text Add to dashboard Cite

This thesis relates to text-to-speech synthesis and deals more particularly with the corpus based approach. In the last few years, this approach based on the concatenation of acoustic segments contained in large databases has become increasingly popular. Indeed, selecting units which best fit the text to be synthesized leads to a synthesised signal whose naturalness can be rather well preserved. The quality of the synthesized speech obtained by corpus-based methods is closely related on the one hand to the corpus used for synthesis and on the other hand to the unit selection algorithm. In spite of the notable increase of quality reached with this technology, corpus-based speech synthesis is not able to guarantee a synthesised speech whose quality is constant on an entire utterance. This is mainly due to the lack of acoustic control of the existing corpus-based speech synthesis systems. The main objective of this thesis is therefore to introduce a mechanism allowing a better acoustical control during synthesis. The proposed method uses statistical approaches to generate a smooth acoustic target from which the sequence of synthesis units will be selected. This target is deduced from acoustic models, namely context dependent senone models, estimated during a training phase. Initially, we propose an algorithm of selection based only on this acoustic target. Then, the proposed selection method is modified so as to better control the information of fundamental frequency. This unit selection module is also combined with a pre-selection module so as to drastically reduce the computational load. Formal listening tests show that the proposed method leads to a significant reduction in acoustic discontinuities during the concatenation. The proposed method is also applied to acoustic database reduction and enables a compression of about 60% of the acoustic database without perceptible decrease of the speech quality. Liste des tableaux 5.3 Moyennes etécarts types de la distorsion spectrale aux points de concaténations de la méthode proposée avec les différentes bases. .

show abstract

Recent improvements to the IBM trainable speech synthesis system

Cited by 33 publications

References 6 publications

Applying data mining techniques to corpus based prosodic modeling

Applying data mining techniques to corpus based prosodic modeling

Additive Modeling of English F0 Contour for Speech Synthesis

Unit Selection for Speech Synthesis Based on Acoustic Criteria

Contact Info

Product

Resources

About