Towards the Prediction of the Vocal Tract Shape from the Sequence of Phonemes to be Articulated

Ribeiro, Vinícius Campos Tinoco; Isaieva, Karyna; Leclère, Justine; Vuissoz, Pierre‐André; Laprie, Yves

doi:10.21437/interspeech.2021-184

Cited by 3 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A probable explanation for this effect is that the ground truth is noisy once it is subjected to tracking errors. These tracking errors in the target curve impose a performance upper bound in the previous approach [8]. However, since we enforce phoneme-wise constraints in the reconstruction, the critical loss inputs prior domain knowledge to the model generating a potentially more realistic result than the ground truth, which explains why the ρTBCD and ρTTCD are slightly lower than in the previous work.…”

Section: Phoneme To Autoencoder's Componentsmentioning

confidence: 91%

“…The encoder-decoder network that maps phonemes to the autoencoder's latent space is very similar to the one used in [8]. The same GRU-based encoder with a linear reshaping layer is used.…”

Section: Phoneme To Autoencoder's Componentsmentioning

confidence: 99%

“…Previous work [8,9] proposed an encoder-decoder neural network to predict the vocal tract shape for a sequence of phonemes to be articulated. The former was the first to provide the complete vocal tract shape, including all speech articulators from the glottis to the lips.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Autoencoder-Based Tongue Shape Estimation During Continuous Speech

Ribeiro¹,

Laprie²

2022

Interspeech 2022

View full text Add to dashboard Cite

Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical articulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data's encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve's main components, i.e., the autoencoder's representation. We show how this approach allows imposing critical articulators' constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.

show abstract

Section: Phoneme To Autoencoder's Componentsmentioning

confidence: 91%

“…The encoder-decoder network that maps phonemes to the autoencoder's latent space is very similar to the one used in [8]. The same GRU-based encoder with a linear reshaping layer is used.…”

Section: Phoneme To Autoencoder's Componentsmentioning

confidence: 99%

See 1 more Smart Citation

Autoencoder-Based Tongue Shape Estimation During Continuous Speech

Ribeiro¹,

Laprie²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Along with this work, we extend the research to a larger dataset than that used in [23]. The dataset [24] is composed of one male French native speaker.…”

Section: Corpusmentioning

confidence: 94%

“…In our most recent work, we proposed the first attempt to predict the vocal tract shape from the phonemes to be articulated [23]. We proposed a deep neural network to predict the positions of five articulators, i.e., the tongue, the upper and lower lips, the soft palate, and the pharyngeal wall.…”

Section: Introductionmentioning

confidence: 99%