Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2046
|View full text |Cite
|
Sign up to set email alerts
|

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

Abstract: Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which do… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
19
0
3

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 12 publications
(23 citation statements)
references
References 30 publications
1
19
0
3
Order By: Relevance
“…Articulation is the ability to physically move the tongue, lips, teeth, and chin to produce a series of voice sounds that make up words and sentences. So an articulation-to-speech(ATS) synthesis is underway to restore the original voices of patients who produce abnormal languages, such as speech disorders [ 101 , 102 , 103 , 104 ]. Because this technique maps articulatory information directly to speech [ 104 ], it is possible to generate voice from articulatory movement data without subject making any sound [ 103 ].…”
Section: Deep Learning Based Voice Recognitionmentioning
confidence: 99%
See 2 more Smart Citations
“…Articulation is the ability to physically move the tongue, lips, teeth, and chin to produce a series of voice sounds that make up words and sentences. So an articulation-to-speech(ATS) synthesis is underway to restore the original voices of patients who produce abnormal languages, such as speech disorders [ 101 , 102 , 103 , 104 ]. Because this technique maps articulatory information directly to speech [ 104 ], it is possible to generate voice from articulatory movement data without subject making any sound [ 103 ].…”
Section: Deep Learning Based Voice Recognitionmentioning
confidence: 99%
“…So an articulation-to-speech(ATS) synthesis is underway to restore the original voices of patients who produce abnormal languages, such as speech disorders [ 101 , 102 , 103 , 104 ]. Because this technique maps articulatory information directly to speech [ 104 ], it is possible to generate voice from articulatory movement data without subject making any sound [ 103 ]. In addition to the sensors(EMA, PMA, sEMG) introduced in Section 2 , the movement of articulators can be captured using various information, such as ultrasound tongue imaging (UTI), non-audible murmur (NAM), and the advantage of being able to collect voice information, especially from people who can’t make sounds, enables compensation for insufficient training data [ 77 ].…”
Section: Deep Learning Based Voice Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Prosody is mainly conditioned by the airflow and the vibration of the vocal folds, which in the case of laryngectomised patients is not possible to recover. As a result, most direct synthesis techniques generating a voice from sensed articulatory movements can, at best, recover a monotonous voice with limited pitch variations [101], [281], [282]. The use of complementary information capable of restoring prosodic features is thus an important area for future research.…”
Section: A Improved Sensing Techniquesmentioning
confidence: 99%
“…The early studies on articulatory-to-acoustic mapping typically applied a low-order spectral representation, for example, only 12 coefficients were used in [8,10]. Later, our team also experimented with using 22 kHz speech and 24-order MGC-LSP target [15] (with the gain, having 25 dimensions altogether). Still, the 24-order MGC-LSP target is a relatively lowdimensional spectral representation, and this simple vocoder that we used in previous studies [10,11,12,13,14,15] can be a bottleneck in the ultrasound-to-speech mapping framework.…”
Section: Introductionmentioning
confidence: 99%