2020
DOI: 10.1007/978-3-030-59716-0_45
|View full text |Cite
|
Sign up to set email alerts
|

Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 19 publications
0
11
0
Order By: Relevance
“…Formally, our networks has to map each MRI image to a spectral vector. However, using several consecutive input frames instead of a single frame can significantly improve the results [6], [32]. Hence, the input for all our network configurations was a 3D array, treating time as the the third axis besides the two spacial axes of the images.…”
Section: D-cnn+bilstmmentioning
confidence: 99%
See 2 more Smart Citations
“…Formally, our networks has to map each MRI image to a spectral vector. However, using several consecutive input frames instead of a single frame can significantly improve the results [6], [32]. Hence, the input for all our network configurations was a 3D array, treating time as the the third axis besides the two spacial axes of the images.…”
Section: D-cnn+bilstmmentioning
confidence: 99%
“…3D-CNN: Several authors argued recently that good video classification is also achievable using purely convolutional structures by extending the convolution to the temporal axis [6], [32], [36]. This is why a 3D convolutional model served as our second model (see Table I).…”
Section: D-cnn+bilstmmentioning
confidence: 99%
See 1 more Smart Citation
“…While this simple arrangement already performs reasonably well, significant improvement can be achieved by involving the input context, that is, by using a block of video frames as input instead of just one image. Several network architectures have been proposed to process 3D blocks of input data, for video processing in general [24,25,26], and for ultrasound input in particular [12,14,27,16,28]. In the experimental section we will experiment both with 2D and 3D Convolutional Neural Networks (CNNs) for the mapping task.…”
Section: The Uti-to-speech Frameworkmentioning
confidence: 99%
“…The goal is to convert this recording of the articulatory movement into a speech signal. Many possible approaches exist for this, but the most recent studies all apply deep neural networks (DNNs) for this task [23,4,21,7,27], and here we also apply neural structures that combine convolutional neural network (CNN) layers and recurrent layers such as the long short-term memory (LSTM) layer.…”
Section: Introductionmentioning
confidence: 99%