Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-184
|View full text |Cite
|
Sign up to set email alerts
|

Towards the Prediction of the Vocal Tract Shape from the Sequence of Phonemes to be Articulated

Abstract: In this work, we address the prediction of speech articulators' temporal geometric position from the sequence of phonemes to be articulated. We start from a set of real-time MRI sequences uttered by a female French speaker. The contours of five articulators were tracked automatically in each of the frames in the MRI video. Then, we explore the capacity of a bidirectional GRU to correctly predict each articulator's shape and position given the sequence of phonemes and their duration. We propose a 5-fold cross-v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 14 publications
0
8
0
Order By: Relevance
“…A probable explanation for this effect is that the ground truth is noisy once it is subjected to tracking errors. These tracking errors in the target curve impose a performance upper bound in the previous approach [8]. However, since we enforce phoneme-wise constraints in the reconstruction, the critical loss inputs prior domain knowledge to the model generating a potentially more realistic result than the ground truth, which explains why the ρTBCD and ρTTCD are slightly lower than in the previous work.…”
Section: Phoneme To Autoencoder's Componentsmentioning
confidence: 91%
See 2 more Smart Citations
“…A probable explanation for this effect is that the ground truth is noisy once it is subjected to tracking errors. These tracking errors in the target curve impose a performance upper bound in the previous approach [8]. However, since we enforce phoneme-wise constraints in the reconstruction, the critical loss inputs prior domain knowledge to the model generating a potentially more realistic result than the ground truth, which explains why the ρTBCD and ρTTCD are slightly lower than in the previous work.…”
Section: Phoneme To Autoencoder's Componentsmentioning
confidence: 91%
“…The encoder-decoder network that maps phonemes to the autoencoder's latent space is very similar to the one used in [8]. The same GRU-based encoder with a linear reshaping layer is used.…”
Section: Phoneme To Autoencoder's Componentsmentioning
confidence: 99%
See 1 more Smart Citation
“…Along with this work, we extend the research to a larger dataset than that used in [23]. The dataset [24] is composed of one male French native speaker.…”
Section: Corpusmentioning
confidence: 94%
“…In our most recent work, we proposed the first attempt to predict the vocal tract shape from the phonemes to be articulated [23]. We proposed a deep neural network to predict the positions of five articulators, i.e., the tongue, the upper and lower lips, the soft palate, and the pharyngeal wall.…”
Section: Introductionmentioning
confidence: 99%