Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-23
|View full text |Cite
|
Sign up to set email alerts
|

Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video

Abstract: We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 23 publications
(54 reference statements)
0
11
0
Order By: Relevance
“…The theoretical background is provided by articulatory-to-acoustic mapping (AAM), where articulatory data is recorded while the subject is speaking, and machine learning methods (typically deep neural networks (DNNs)) are applied to predict the speech signal from the articulatory input. The set of articulatory acquisition devices includes ultrasound tongue imaging (UTI) [4,5,6,7,8], Magnetic Resonance Imaging (MRI) [9], electromagnetic articulography (EMA) [10,11,12], permanent magnetic articulography (PMA) [13,14,15], surface electromyography (sEMG) [16,17,18], electro-optical stomatography (EOS) [19], lip videos [20,21], or a multimodal combination of the above [22].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The theoretical background is provided by articulatory-to-acoustic mapping (AAM), where articulatory data is recorded while the subject is speaking, and machine learning methods (typically deep neural networks (DNNs)) are applied to predict the speech signal from the articulatory input. The set of articulatory acquisition devices includes ultrasound tongue imaging (UTI) [4,5,6,7,8], Magnetic Resonance Imaging (MRI) [9], electromagnetic articulography (EMA) [10,11,12], permanent magnetic articulography (PMA) [13,14,15], surface electromyography (sEMG) [16,17,18], electro-optical stomatography (EOS) [19], lip videos [20,21], or a multimodal combination of the above [22].…”
Section: Introductionmentioning
confidence: 99%
“…mean ultrasound image) helps the model generalize to unseen speakers. The same authors reported that unsupervised model adaptation can improve the results for silent speech (but not for modal speech) [6]. They also performed multi-speaker recognition and synthesis experiments where they applied x-vectors for speaker conditioning -but they extracted the x-vectors from the acoustic data and not from the ultrasound [25].…”
Section: Introductionmentioning
confidence: 99%
“…The same authors reported that unsupervised model adaptation can improve the results for silent speech (but not for modal speech) [83]. They also performed multi-speaker recognition and synthesis experiments where they applied x-vectors for speaker conditioning -but they extracted the x-vectors from the acoustic data and not from the ultrasound [82].…”
Section: Problem Description and Literature Overviewmentioning
confidence: 99%
“…To handle the session dependency of UTIbased synthesis, Gosztolya et al used data from different sessions [28]. Ribeiro et al reported that, for a speaker-independent system, unsupervised model adaptation can improve the results for silent speech [83]. In a multi-speaker framework, in Chapter 4 we experimented with the use of x-vectors features extracted from the speakers, leading to a marginal improvement in the spectral estimation step [37].…”
Section: Chaptermentioning
confidence: 99%
See 1 more Smart Citation