Bayesian learning of speech duration models

Chien, Jen‐Tzung; Huang, Chih-Hsien

doi:10.1109/tsa.2003.818114

Cited by 24 publications

(4 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the MCDC8, the duration of ordinary syllables ranges from 15 to 1110 ms (mean, 173 ms) and is equal to 5.78 syllables per second, which is faster than the articulation rate of news reporters in a Chinese Broadcast News corpus (Chien and Huang, 2003). We ran the variant selection algorithm on all disyllabic words in the MCDC8.…”

Section: A Disyllabic Words In Conversational Speechmentioning

confidence: 99%

Deriving disyllabic word variants from a Chinese conversational speech corpus

Liu

Tseng

Jang

2016

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

Motivated by the quasi-categorical reduced forms of disyllabic words produced in Chinese conversational speech, a frequency-based selection procedure of typical pronunciation by disyllabic word type and reduction degree is proposed in this paper. This variant-selection algorithm utilizes techniques of free phone recognition and phonetic similarity score calculation to account for Chinese syllable structure. Four reduction types are suggested by considering the presence of a within-word syllable boundary: Citation form-like reduction, marginal segment deletion, nuclei merger, and syllable merger. The results show that the most frequent reduction types for disyllabic words in Chinese conversation are citation form-like reduction and syllable merger. In particular, high-frequency disyllabic words preferentially take the extreme syllable-merger form. As shown in the analysis, segmental reduction in Chinese disyllabic words is morphology-dependent. It is also related to the prosodic position at which a disyllabic word is produced as well as the temporal quality of the word. Finally, in the automatic speech recognition experiments, the performance was improved by adding a small number of variants selected by the algorithm to the pronunciation dictionary of the system.

show abstract

Section: A Disyllabic Words In Conversational Speechmentioning

confidence: 99%

Deriving disyllabic word variants from a Chinese conversational speech corpus

Liu

Tseng

Jang

2016

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

show abstract

“…The phone duration modeling approaches are divided in two major categories: The rule-based (Klatt, 1979) and the data-driven methods (Mobius and Santen, 1996;Santen, 1992;Chen et al, 1998;Chien and Huang, 2003;Lazaridis et al, 2007). In the rulebased methods manually produced rules, extracted from experimental studies on large sets of utterances or based on previous knowledge, are utilized for determining the duration of segments.…”

Section: Introductionmentioning

confidence: 99%

“…Over the last years various statistical methods have been applied in the phone duration modeling task such as, Linear Regression (LR) (Takeda et al, 1989), decisions tree-based models (Mobius and Santen, 1996), Sums-Of-Products (SOP) (Santen, 1992). Artificial Neural Networks (ANN) techniques (Chen et al, 1998), Bayesian models (Chien and Huang, 2003) and instance-based algorithms (Lazaridis et al, 2007) have also been introduced on the phone duration modeling task. Consequently the data-driven approaches offer us the ability to overcome the time consuming labor of the manual extraction of the rules which are needed in the rule-based approaches.…”

Section: Introductionmentioning

confidence: 99%

Comparative Evaluation of Phone Duration Models for Greek Emotional Speech

Lazaridis¹

2010

Journal of Computer Science

View full text Add to dashboard Cite

Problem statement:In this study we cope with the task of phone duration modeling for Greek emotional speech synthesis. Approach: Various well established machine learning techniques are applied for this purpose to an emotional speech database consisting of five archetypal emotions. The constructed phone duration prediction models are built on phonetic, morphosyntactic and prosodic features that can be extracted only from text. We employ model and regression trees, linear regression, lazy learning algorithms and meta-learning algorithms using regression trees as base classifiers, trained on a Modern Greek emotional database consisting of five emotional categories: anger, fear, joy, neutral and sadness. Results: Model trees based on the M5' algorithm and meta-learning algorithms using as base classifier regression trees based on the M5' algorithm proved to perform better. Conclusion: It was observed that the emotional categories of the speech database with the most uniform distribution of phone durations built the most accurate models.

show abstract

“…Alternatively, besides following the formal transition probability estimation for HMM, the lack of distinct duration modelling for non-stationary SRs may be addressed by SR dependent HMM [Anastasakos et al., ] [Zheng et al, 2003 or the estimation of transition dependent probability distributions modelling discrete duration length [Chien & Huang, 2003].…”

Section: Initial Acoustic Model Estimationmentioning

confidence: 99%

Large vocabulary continuous speech recognition for the transcription of Catalan broadcast news and conversations : towards analysis and modelling of acoustic reduction in spontaneous speech

Schulz

View full text Add to dashboard Cite

The transcription of spontaneous speech still poses a challenge to state-of-the-art methods for automatic speech recognition. The present thesis describes the comprehensive development of a large vocabulary continuous speech recognition system for the transcription of Catalan broadcast news and conversions and evolves towards novel approaches for analysis and modelling of acoustic reduction in spontaneous speech. It emphasises initially on various conventional methods for acoustic analysis, acoustic and language modelling and hypothesis search. Improvements over the original single-pass baseline system are mainly attained by domain and speaking style emphasising interpolation of individually estimated language models, linear discriminating projection of acoustic observations that improves the phonetic class separability, speaker normalisation of the acoustic observations, speaker adaptive training and acoustic model adaptation in a multi-pass system approach. The analysis of acoustic reduction initially emphasises on context independent vowel and consonant specific spectral and temporal properties whose parameters display statistically significant differences between the phoneme prototypes in spontaneous speech and their canonical realisations in planned speech. The introduction of the feature space analysis provides the general means to reveal these differences in conventional acoustic observations for automatic speech recognition. It displays statistically significant differences context-independently but also in a syllable context between adjacent phonemes suggesting particular reduction patterns. The analysis furthermore challenges the often suggested coherence between the co-occurring reduction of spectral and temporal properties. The modelling of acoustic reduction first emphasises on segment conditioned discriminating variables and variability class dependent models and variability class specific adaptation of the original acoustic model. It introduces phoneme rate as means to analyse temporal properties and feature space reduction ratio as means to analyse the reduction of spectral properties in conventional feature space for large vocabulary continuous speech recognition as discriminating variables. These variables are clustered and determine the classes for segment conditioned variability class dependent models and their scoring during the hypothesis search in recognition. Both approaches displays no significant performance improvement. Furthermore the modelling advances towards segment constituent predictability dependent models that introduce predictability as discriminating variable for variability class dependent models relying on the fundamental coherence between predictability and acoustic reduction that is suggested through the principle of least effort and the redundancy theory. It thereby emphasises on word and phoneme predictability. This approach displays no significant performance improvement. Planned speech is apparently antagonising the principle of least effort. Thus, a prior segment conditioned analysis of acoustic reduction may indicate its average degree of reduction, while their within-segment variation may indicate whether it exhibits sufficient relaxation of the speaking style to adopt the principle of least effort. Thus, segments exhibiting small within-segment variation may be modelled separately from those of large within-segment variation, whereas modelling the latter by word, syllable or phoneme predictability dependent models may provide a research perspective. La transcripció de converses espontànies encara suposa un repte per als mètodes actuals de reconeixement automàtic de veu. Aquesta tesi descriu el desenvolupament d'un sistema de reconeixement de veu continu de vocabulari gran per a la transcripció de converses i notícies emeses en català i condueix cap a noves aproximacions per a l'anàlisi i modelat de la reducció acústica en converses espontànies. Es centra inicialment en diversos mètodes convencionals per a l'anàlisi acústica, modelat acústic i del llenguatge i en la cerca d'hipòtesis. Les millores respecte el sistema original d'única passada són principalment degudes al domini i l'estil en la parla posant èmfasi en la interpolació de models de llenguatge, discriminació lineal i projecció d'observacions acústiques, entrenament adaptat al locutor per millorar la separació de les classes fonètiques, normalització de les observacions acústiques, i adaptació del model acústic en una sistema de múltiples passades. L'anàlisi de reducció acústica posa inicialment èmfasi en les propietats espectrals i temporals independents de vocals i consonant específiques, els paràmetres de les quals mostren diferències estadísticament significatives entre els prototips de fonemes en la conversa espontània i la seva realització canònica en el discurs planejat. La introducció de l'anàlisi del espai de característiques proporciona els mitjans generals per a revelar aquestes diferències en observacions acústiques convencionals per al reconeixement automàtic de veu. Mostra diferències estadísticament significatives independents de context però també entre fonemes adjacents en el context de síl·laba suggerint patrons de reducció particulars. A més, l'anàlisi desafia la, sovint suggerida, coherència entre les reducció simultànies de les propietats espectrals i temporals. El modelat de la reducció acústica primer fa èmfasi en variables discriminants de cada segment, models dependents de la variabilitat de la classe i l'adaptació del model acústic original. Introdueix la taxa de fonemes com a mitjà d'analitzar propietats temporals i la proporció de la reducció del espai de característiques com a mitjà d'analitzar la reducció dels propietats espectrals en el espai de característiques convencional per al reconeixement de veu continu de vocabulari gran com a variables discriminants. Aquestes variables s'agrupen i determinen les classes per a models dependents de la variabilitat de cada segment i la seva puntuació durant el reconeixement i cerca d'hipòtesi. Ambdues aproximacions no mostren una millora significativa en el rendiment. A més a més, les tècniques de modelat es dirigeixen cap a models dependents de la predicibilitat del segment que introdueixen la predicibilitat com a variable discriminant per a models dependents de la classe de variabilitat basats en la coherència fonamental entre predicibilitat i reducció acústica que es suggereix pel principi del mínim esforç i la teoria de la redundància. Per tant, emfatitza la predicibilitat de les paraules i dels fonemes. Aquesta aproximació no suposa cap millora significativa de rendiment. El discurs planejat és aparentment antagònic amb el principi del mínim esforç. Per tant, un anàlisi previ condicionat al segment de la reducció acústica pot indicar el seu grau mig de reducció, mentre la variació intra-segmental pot indicar si exhibeix prou relaxació en l'estil de parlar per adoptar el principi del mínim esforç. Per tant, segments amb poca variació intra-segmental poden ser modelats apart dels que tenen gran variació intra-segmental, mentre que modelar aquestes darreres mitjançant models dependents de predicibilitat de paraula, síl·laba o fonema poden aportar una perspectiva viable de recerca.

show abstract

Bayesian learning of speech duration models

Cited by 24 publications

References 25 publications

Deriving disyllabic word variants from a Chinese conversational speech corpus

Deriving disyllabic word variants from a Chinese conversational speech corpus

Comparative Evaluation of Phone Duration Models for Greek Emotional Speech

Large vocabulary continuous speech recognition for the transcription of Catalan broadcast news and conversations : towards analysis and modelling of acoustic reduction in spontaneous speech

Contact Info

Product

Resources

About