Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis

Ling, Zhen-Hua; Richmond, Korin; Yamagishi, Junichi; Wang, Ren-Hua

doi:10.1109/tasl.2009.2014796

Cited by 90 publications

(119 citation statements)

References 26 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…The stimulus words were randomly presented to the listeners, who were asked to first identify the Thai word they heard and then select a naturalness score on a five-level scale from terrible (1) to excellent (5). Listeners were allowed to listen to the stimuli as many times as they preferred.…”

Section: Numerical Assessment and Perceptual Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Identifying underlying articulatory targets of Thai vowels from acoustic data based on an analysis-by-synthesis approach

Prom-on

Birkholz

2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

This paper investigates the estimation of underlying articulatory targets of Thai vowels as invariant representation of vocal tract shapes by means of analysis-by-synthesis based on acoustic data. The basic idea is to simulate the process of learning speech production as a distal learning task, with acoustic signals of natural utterances in the form of Mel-frequency cepstral coefficients (MFCCs) as input, VocalTractLab -a 3D articulatory synthesizer controlled by target approximation models as the learner, and stochastic gradient descent as the target training method. To test the effectiveness of this approach, a speech corpus was designed to contain contextual variations of Thai vowels by juxtaposing nine Thai long vowels in two-syllable sequences. A speech corpus consisting of 81 disyllabic utterances was recorded from a native Thai speaker. Nine vocal tract shapes, each corresponding to a vowel, were estimated by optimizing the vocal tract shape parameters of each vowel to minimize the sum of square error of MFCCs between original and synthesized speech. The stochastic gradient descent algorithm was used to iteratively optimize the shape parameters. The optimized vocal tract shapes were then used to synthesize Thai vowels both in monosyllables and in disyllabic sequences. The results, both numerically and perceptually, indicate that this model-based analysis strategy allows us to effectively and economically estimate the vocal tract shapes to synthesize accurate Thai vowels as well as smooth formant transitions between adjacent vowels.

show abstract

Section: Numerical Assessment and Perceptual Evaluationmentioning

confidence: 99%

“…Understanding how proper articulatory skills can be learned from acoustic data, a task known as acoustic-to-articulatory inversion, is therefore the key to our understanding of the nature of human speech acquisition and production. Such knowledge is also beneficial to both speech recognition [4] and speech synthesis [5].…”

Section: Introductionmentioning

confidence: 99%

Identifying underlying articulatory targets of Thai vowels from acoustic data based on an analysis-by-synthesis approach

Prom-on

Birkholz

2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Recovering the vocal tract shape from speech acoustics could benefit many automatic speech processing system to enrich for instance the acoustic information for synthesis [1] and recognition [2]. In fact, articulatory features are more robust than acoustic features as articulatory features vary very slowly when compared with speech acoustic features.…”

Section: Introductionmentioning

confidence: 99%

Phoneme-to-Articulatory Mapping Using Bidirectional Gated RNN

Biasutto--Lervat¹,

Ouni²

2018

Interspeech 2018

View full text Add to dashboard Cite

To cite this version:Théo Biasutto-Lervat, Slim Ouni. Phoneme-to-Articulatory mapping using bidirectional gated RNN. Interspeech 2018 -19th AbstractDeriving articulatory dynamics from the acoustic speech signal has been addressed in several speech production studies. In this paper, we investigate whether it is possible to predict articulatory dynamics from phonetic information without having the acoustic speech signal. The input data may be considered as not sufficiently rich acoustically, as probably there is no explicit coarticulation information but we expect that the phonetic sequence provides compact yet rich knowledge. Motivated by the recent success of deep learning techniques used in the acoustic-to-articulatory inversion, we have experimented around the bidirectional gated recurrent neural network architectures. We trained these models with an EMA corpus, and have obtained good performances similar to the state-of-theart articulatory inversion from LSF features, but using only the phoneme labels and durations.

show abstract

“…Articulatory movement data obtained using an EMA enjoy wide use in the fields of speech science and technologies, such as the analysis of coarticulation [3], speech therapy [4], estimation of articulatory movement from speech [5], and speech synthesis [6].…”

Section: Introductionmentioning

confidence: 99%