Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1375
|View full text |Cite
|
Sign up to set email alerts
|

Estimating Articulatory Movements in Speech Production with Transformer Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
9

Relationship

1
8

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 0 publications
0
8
0
Order By: Relevance
“…Advancements in deep neural networks (DNNs), especially in processing time series data to capture contextual information has propelled the development of SI systems to new heights. Bidirectional LSTMs (BiLSTMS) [14,15], CNN-BiLSTMs [16,17], Temporal Convolutional Networks (TCN) [18] and transformer models [19] have gained state-of-the-art results with multiple articulatory datasets [20]. In our previous work, we reported the state-of-the-art SI results with the XRMB dataset [13], but with a rather simple feed-forward neural network, using manually contextualized MFCCs as input features and doing speaker adaptation with vocal tract length normalization [13].…”
Section: Introductionmentioning
confidence: 99%
“…Advancements in deep neural networks (DNNs), especially in processing time series data to capture contextual information has propelled the development of SI systems to new heights. Bidirectional LSTMs (BiLSTMS) [14,15], CNN-BiLSTMs [16,17], Temporal Convolutional Networks (TCN) [18] and transformer models [19] have gained state-of-the-art results with multiple articulatory datasets [20]. In our previous work, we reported the state-of-the-art SI results with the XRMB dataset [13], but with a rather simple feed-forward neural network, using manually contextualized MFCCs as input features and doing speaker adaptation with vocal tract length normalization [13].…”
Section: Introductionmentioning
confidence: 99%
“…Over the past few years, deep neural network (DNN) based models have propelled the development of SI systems to new heights. Bidirectional LSTMs (BiLSTMS) [15,16], CNN-BiLSTMs [17,18], Temporal Convolutional Networks (TCN) [19] and transformer models [20] have gained state-of-the-art results with multiple articulatory datasets [21]. To further improve the speech inversion task, people have tried incorporating phonetic transcriptions as an input along with acoustic features [22,17].…”
Section: Introductionmentioning
confidence: 99%
“…Recent advancements in deep neural networks (DNNs) and learning algorithms have significantly contributed to improving SI systems over the last few years. Bidirectional LSTMs (BiLSTMS) [14], BiGRNNs [15], CNN-BiLSTMs [16], Temporal Convolutional Networks (TCN) [17] and transformer models [18] have gained state-of-the-art results with multiple articulatory datasets. Most of these SI systems use either extracted acoustic features like Mel Frequency Cepstral Coefficients (MFCCs), Mel-spectrograms or the waveform itself as the input speech representation, and learns a mapping to the ground-truth articulatory variables.…”
Section: Introductionmentioning
confidence: 99%