Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video

Ribeiro, Manuel Sam; Eshky, Aciel; Richmond, Korin; Renals, Steve

doi:10.21437/interspeech.2021-23

Cited by 9 publications

(11 citation statements)

References 23 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The theoretical background is provided by articulatory-to-acoustic mapping (AAM), where articulatory data is recorded while the subject is speaking, and machine learning methods (typically deep neural networks (DNNs)) are applied to predict the speech signal from the articulatory input. The set of articulatory acquisition devices includes ultrasound tongue imaging (UTI) [4,5,6,7,8], Magnetic Resonance Imaging (MRI) [9], electromagnetic articulography (EMA) [10,11,12], permanent magnetic articulography (PMA) [13,14,15], surface electromyography (sEMG) [16,17,18], electro-optical stomatography (EOS) [19], lip videos [20,21], or a multimodal combination of the above [22].…”

Section: Introductionmentioning

confidence: 99%

“…mean ultrasound image) helps the model generalize to unseen speakers. The same authors reported that unsupervised model adaptation can improve the results for silent speech (but not for modal speech) [6]. They also performed multi-speaker recognition and synthesis experiments where they applied x-vectors for speaker conditioning -but they extracted the x-vectors from the acoustic data and not from the ultrasound [25].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces

et al. 2021

View full text Add to dashboard Cite

Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The same authors reported that unsupervised model adaptation can improve the results for silent speech (but not for modal speech) [83]. They also performed multi-speaker recognition and synthesis experiments where they applied x-vectors for speaker conditioning -but they extracted the x-vectors from the acoustic data and not from the ultrasound [82].…”

Section: Problem Description and Literature Overviewmentioning

confidence: 99%

“…To handle the session dependency of UTIbased synthesis, Gosztolya et al used data from different sessions [28]. Ribeiro et al reported that, for a speaker-independent system, unsupervised model adaptation can improve the results for silent speech [83]. In a multi-speaker framework, in Chapter 4 we experimented with the use of x-vectors features extracted from the speakers, leading to a marginal improvement in the spectral estimation step [37].…”

Section: Chaptermentioning

confidence: 99%

See 1 more Smart Citation

Improvements of Silent Speech Interface Algorithms

Honarmandi Shandiz

View full text Add to dashboard Cite

Gammatone filter features are another type of speech feature extraction method that is based on modeling the human auditory system. They are calculated by filtering the speech signal with a bank of gammatone filters, which are modeled after the tuning of the auditory system's hair cells. The output of each filter is then rectified and low-pass filtered, and the resulting signals are then used as features.This function is commonly used in ANNs as an activation function for hidden layers It is able to produce speech with natural-sounding intonation and prosody.

show abstract

Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training

Zheng

Ling

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video

Cited by 9 publications

References 23 publications

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces

Improvements of Silent Speech Interface Algorithms

Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training

Contact Info

Product

Resources

About