TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Ribeiro, Manuel Sam; Sanger, Jennifer; Zhang, Jingxuan; Eshky, Aciel; Wrench, Alan; Richmond, Korin; Renals, Steve

doi:10.48550/arxiv.2011.09804

Cited by 3 publications

(15 citation statements)

References 35 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our best MCD score of 3.08 corresponds to a low-quality but intelligible speech [4]. In comparison, Ribeiro et al obtained an MCD score of 2.99 on the same corpus using more sophisticated encoder-decoder networks [20].…”

Section: The Impact Of Vad On the Ssimentioning

confidence: 67%

See 1 more Smart Citation

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Shandiz¹,

Tóth²

2021

Text, Speech, and Dialogue

View full text Add to dashboard Cite

Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement of the articulatory organs during speech, and we aim to reconstruct the speech signal from this recording. Our SSI system synthesizes speech from ultrasonic videos of the tongue movement, and the quality of the resulting speech signals are evaluated by metrics such as the mean squared error loss function of the underlying neural network and the Mel-Cepstral Distortion (MCD) of the reconstructed speech compared to the original. Here, we first demonstrate that the amount of silence in the training data can have an influence both on the MCD evaluation metric and on the performance of the neural network model. Then, we train a convolutional neural network classifier to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal. In the experiments our ultrasound-based speech/silence separator achieved a classification accuracy of about 85% and an AUC score around 86%.

show abstract

Section: The Impact Of Vad On the Ssimentioning

confidence: 67%

“…For the experiments we used the English TAL corpus [20]. It contains parallel ultrasound, speech and lip video recordings from 81 native English speakers, and we used just the TaL1 subset which contains recordings from one male native speaker.…”

Section: The Ultrasound Datasetmentioning

confidence: 99%

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Shandiz¹,

Tóth²

2021

Text, Speech, and Dialogue

View full text Add to dashboard Cite

show abstract

“…As Fig 1 shows, the input to our system is a sequence of ultrasound tongue imaging (UTI) frames, and the target sequence is a speech signal. This is a sequence-to-sequence mapping problem, which could be addressed by sophisticated encoder-decoder networks that would not even require aligned training data [25]. However, as we have synchronized input-output samples, most authors apply simpler networks that perform the mapping frame by frame [28,5].…”

Section: The Ssi Frameworkmentioning

confidence: 99%

“…In the experiments we used the TaL80 corpus [25], which contains ultrasound, speech and lip video recordings from 81 speakers. Apart from the silent speech experiments, the speech signals were also recorded in parallel with the ultrasound, and here we used these synchronized ultrasound and speech tracks.…”

Section: Experimental Set-upmentioning

confidence: 99%

“…The same authors reported that unsupervised model adaptation can improve the results for silent speech (but not for modal speech) [6]. They also performed multi-speaker recognition and synthesis experiments where they applied x-vectors for speaker conditioning -but they extracted the x-vectors from the acoustic data and not from the ultrasound [25]. and adaptation methods have been proposed as remedies, perhaps the simplest being is to use auxiliary input features that encapsulate information on speaker characteristics.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces

et al. 2021

View full text Add to dashboard Cite

Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

show abstract

Improvements of Silent Speech Interface Algorithms

Honarmandi Shandiz

View full text Add to dashboard Cite

Gammatone filter features are another type of speech feature extraction method that is based on modeling the human auditory system. They are calculated by filtering the speech signal with a bank of gammatone filters, which are modeled after the tuning of the auditory system's hair cells. The output of each filter is then rectified and low-pass filtered, and the resulting signals are then used as features.This function is commonly used in ANNs as an activation function for hidden layers It is able to produce speech with natural-sounding intonation and prosody.

show abstract

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Cited by 3 publications

References 35 publications

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces

Improvements of Silent Speech Interface Algorithms

Contact Info

Product

Resources

About