F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces

Grósz, Tamás; Gosztolya, Gábor; Tóth, László; Csapó, Tamás Gábor; Markó, Alexandra

doi:10.1109/icassp.2018.8461732

Cited by 21 publications

(49 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The optimal parameters of the CNN architecture were calculated in an earlier hyperparameter optimization for the MGC-LSP target [33]. Note that using several consecutive images as input, or applying recurrent architectures can lead to better results [10,11,33], but here we did not apply these in order to test scenarios which are more suitable for real-time implementation. The cost function applied for the log(F0) and MGC-LSP regression task was the meansquared error (MSE), while for the V/UV classification we used cross-entropy.…”

Section: Dnn Training With the Baseline Vocodermentioning

confidence: 99%

“…This has the main idea of recording the soundless articulatory movement, and automatically generating speech from the movement information, while the subject is not producing any sound. For this automatic conversion task, typically electromagnetic articulography (EMA) [2,3,4,5], ultrasound tongue imaging (UTI) [6,7,8,9,10,11,12,13], permanent magnetic articulography (PMA) [14,15], surface electromyography (sEMG) [16,17,18], Non-Audible Murmur (NAM) [19] or video of the lip movements [7,20] are used.…”

Section: Introductionmentioning

confidence: 99%

“…There are two distinct ways of SSI solutions, namely 'direct synthesis' and 'recognition-and-synthesis' [21]. In the first case, the speech signal is generated without an intermediate step, directly from the articulatory data, typically using vocoders [4,5,6,8,9,10,11,15,17]. In the second case, silent speech recognition (SSR) is applied on the biosignal which extracts the content spoken by the person (i.e.…”

Section: Introductionmentioning

confidence: 99%

“…the result of this step is text); this step is then followed by text-to-speech (TTS) synthesis [2,3,7,13,14,18]. In the SSR+TTS approach, any information related to speech prosody is totally lost, while several studies have shown that certain prosodic components may be estimated reasonably well from the articulatory recordings (e.g., pitch [11,16,22,23]). Also, the smaller delay by the direct synthesis approach might enable conversational use.…”

Section: Introductionmentioning

confidence: 99%

“…Although their objective F0 prediction scores were promising, they did not evaluate their system in human listening tests [23]. In [11], we experimented with deep neural networks to perform articulatory-to-acoustic conversion from ultrasound images, with an emphasis on estimating the voicing feature and the F0 curve from the ultrasound input. We attained a correlation rate of 0.74 between the original and the predicted F0 curves, and an accuary of 87% in V/UV prediction (when using five consecutive images as input).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

et al. 2019

View full text Add to dashboard Cite

Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-toacoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.

show abstract

Section: Dnn Training With the Baseline Vocodermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

et al. 2019

View full text Add to dashboard Cite

show abstract

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Tóth

Shandiz

2020

Artificial Intelligence and Soft Computing

View full text Add to dashboard Cite

Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

show abstract

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Shandiz

Tóth

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces

Cited by 21 publications

References 12 publications

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Contact Info

Product

Resources

About