2003
DOI: 10.1007/3-540-45011-4_6
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of a Segmental Durations Model for TTS

Abstract: Abstract:In this paper we present a condensed description of a European Portuguese segmental duration's model for TTS purposes and concentrate on its evaluation. This model is based on artificial neural networks. The evaluation of the model quality was made by comparison with read speech. The standard deviation reached in test set is 19.5 ms and the linear correlation coefficient is 0.84. The model is perceptually evaluated with 4.12 against 4.30 for natural human read speech in a scale of 5.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
0
0
1

Year Published

2007
2007
2015
2015

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 7 publications
0
0
0
1
Order By: Relevance
“…Por su notable incidencia en la naturalidad de la síntesis éste es, precisamente, uno de los aspectos que más atención ha recibido en los últimos años. Los modelos prosódicos concebidos para la conversión de texto en habla se centran, en general, en los aspectos relacionados con la entonación: la delimitación de unidades prosódicas o phrasing (Rodríguez (Córdoba, Montero, Gutiérrez Arriola, Vallejo, Enríquez & Pardo 2002, Córdoba, Vallejo, Montero, Gutiérrez Arriola, López Carmona & Pardo 1999, Meza, Kirschning & Cervantes 2000y Santos, Muñoz & Martínez Martín 1988, en catalán (Febrer, Padrell & Bonafonte 1998), en gallego (Fernández Salgado & Rodríguez Banga 1999) o en portugués (Teixeira & Freitas 2003) o la asignación de pausas (Adell, Bonafonte & Escudero 2007, Agüero & Bonafonte 2003, Barbosa 1997y Puigví, Jiménez & Fernández 1994, aunque en ocasiones contemplan también parámetros menos estudiados como la intensidad (Blecua & Acín 1995) y fenómenos globales como el ritmo .…”
Section: La Conversión De Texto En Hablaunclassified
“…Por su notable incidencia en la naturalidad de la síntesis éste es, precisamente, uno de los aspectos que más atención ha recibido en los últimos años. Los modelos prosódicos concebidos para la conversión de texto en habla se centran, en general, en los aspectos relacionados con la entonación: la delimitación de unidades prosódicas o phrasing (Rodríguez (Córdoba, Montero, Gutiérrez Arriola, Vallejo, Enríquez & Pardo 2002, Córdoba, Vallejo, Montero, Gutiérrez Arriola, López Carmona & Pardo 1999, Meza, Kirschning & Cervantes 2000y Santos, Muñoz & Martínez Martín 1988, en catalán (Febrer, Padrell & Bonafonte 1998), en gallego (Fernández Salgado & Rodríguez Banga 1999) o en portugués (Teixeira & Freitas 2003) o la asignación de pausas (Adell, Bonafonte & Escudero 2007, Agüero & Bonafonte 2003, Barbosa 1997y Puigví, Jiménez & Fernández 1994, aunque en ocasiones contemplan también parámetros menos estudiados como la intensidad (Blecua & Acín 1995) y fenómenos globales como el ritmo .…”
Section: La Conversión De Texto En Hablaunclassified
“…TD-PSOLA algorithms (Charpentier & Moulines, 1990) allow the F0 and durations modifications within some limitations. Namely, it is not recommended to change the F0 and/or durations for 2 times higher or lower the original F0 and/or duration, due the severe lost in speech quality (Teixeira, 2012) (Barros, 2002).…”
Section: Introductionmentioning
confidence: 99%
“…Depending on the type of prosody concept used to deal with the intonation (F0 curves) different patterns of F0 will be produced. In this case the most common F0 modulation concepts are the: F0 Fujisaki model (Teixeira, 2012) that is a physiological model developed by (Fujisaki, 1983), with the purpose of dividing the F0 into three components added in the logarithmic domain; the Tone and Break Indices (ToBI) widely used for intonation and prosody structure representation that is based on a set of absolute and relative tone marks and the break indices (Pierrehumbert, 1980); the Tilt model (Taylor, 2000) that represents intonation in the shape of a linear sequence of events, which may be F0 accents or boundary tones; and the INTSINT proposed by (Hirst & Di Cristo, 1998) that is an intonation transcription system which codifies F0 patterns using a set of abstract tone symbols. Concerning the modulation of the length of the phoneme sounds the most classical models are: statistical models which apply generic tools such Classification and Regression Trees (CART) or Artificial Neural Networks (ANN) as documented by Teixeira and Freitas in (Teixeira & Freitas, 2005), that uses a several features based on the context, phoneme class, and intonation groups in the input of a set of dedicated ANN to produce the duration of the actual phoneme sound; the rule-based models such as the (Keller & Zellner, 1997) model which applies more or less complex rules to lengthen or shorten the duration of the segments; mathematical models such as (Klatt, 1976) or the (Van Santen, 1997) models which combines a multiple features into a single expression, usually a sum-of-products that establish the duration of the segment; and finally some models that combine several of those functionalities as does (Campbell, 2000) and the (Barbosa & Bailly, 1994) models that combine neural networks and mathematical models.…”
Section: Introductionmentioning
confidence: 99%