A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology

Lin, Yu Yi; Zheng, Wei; Chu, Wei; Han, Ji-Yan; Hung, Ying Hsiu; Ho, Guan Min; Chang, Chia Yuan; Lai, Yeong‐Lin

doi:10.3390/app11062477

Cited by 20 publications

(13 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For our first experiment, we chose the Mandarin commands recognition benchmark [30] collected from Dysarthric patients as private data. This benchmark dataset includes ten highfrequent action commands: close, up, down, previous, next, in, out, left, right, and home; and nine spoken digits: one, two, three, four, five, six, seven, eight, and nine with 16kHz sampling rate in a total of 600 utterances.…”

Section: Spoken Command Recognition and Resultsmentioning

confidence: 99%

“…This benchmark dataset includes ten highfrequent action commands: close, up, down, previous, next, in, out, left, right, and home; and nine spoken digits: one, two, three, four, five, six, seven, eight, and nine with 16kHz sampling rate in a total of 600 utterances. Adopting the experimental setting described in [30], we split the audio data into 70% and 30% for training and testing set under a 7-folds crossvalidation scheme. To set up public data for training student model, we use the public Common Voice dataset [31] and collect the same Mandarin command actions and 600 utterances from the Dysarthric dataset.…”

Section: Spoken Command Recognition and Resultsmentioning

confidence: 99%

See 1 more Smart Citation

An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

Yang¹,

Qi²,

Siniscalchi³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework that introduces DP-preserving noise at the output of the teacher models and transfers DP-preserving properties via noisy labels. Privacy-preserving student models are then trained with the noisy labels to learn the knowledge with DP-protection from the teacher model ensemble. Experimental evidences on spoken command recognition and continuous speech recognition of Mandarin speech show that our proposed framework greatly outperforms existing benchmark DP-preserving algorithms in both speech processing tasks.

show abstract

Section: Spoken Command Recognition and Resultsmentioning

confidence: 99%

Section: Spoken Command Recognition and Resultsmentioning

confidence: 99%

An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

Yang¹,

Qi²,

Siniscalchi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A CNN for audio digit classification with Mel spectrogram received 97.53%. A phonetic posteriorgram (PPG) speech feature with CNN was applied in speech command controlbased recognition [42]. The dataset was created by 3 cerebral palsy (CP) patients who spoke 19 Mandarin commands 10 times each.…”

Section: Speech Recognition Tasksmentioning

confidence: 99%

An Acoustic Feature-Based Deep Learning Model for Automatic Thai Vowel Pronunciation Recognition

Rukwong

Pongpinigpinyo

2022

Applied Sciences

View full text Add to dashboard Cite

For Thai vowel pronunciation, it is very important to know that when mispronunciation occurs, the meanings of words change completely. Thus, effective and standardized practice is essential to pronouncing words correctly as a native speaker. Since the COVID-19 pandemic, online learning has become increasingly popular. For example, an online pronunciation application system was introduced that has virtual teachers and an intelligent process of evaluating students that is similar to standardized training by a teacher in a real classroom. This research presents an online automatic computer-assisted pronunciation training (CAPT) using deep learning to recognize Thai vowels in speech. The automatic CAPT is developed to solve the inadequacy of instruction specialists and the complex vowel teaching process. It is a unique system that develops computer techniques integrated with linguistic theory. The deep learning model is the most significant part of recognizing vowels pronounced for the automatic CAPT. The major challenge in Thai vowel recognition is the correct identification of Thai vowels when spoken in real-world situations. A convolutional neural network (CNN), a deep learning model, is applied and developed in the classification of pronounced Thai vowels. A new dataset for Thai vowels was designed, collected, and examined by linguists. The result of an optimal CNN model with Mel spectrogram (MS) achieves the highest accuracy of 98.61%, compared with Mel frequency cepstral coefficients (MFCC) with the baseline long short-term memory (LSTM) model and MS with the baseline LSTM model have an accuracy of 94.44% and 90.00% respectively.

show abstract

“…According to previous research, CNNs possess strong adaptability and gradually have become the main research tool in the field of image and speech [43,44]. In the study of speaker recognition, the spectrogram [45] gives a large amount of information including the personality characteristics of the speaker, and dynamically shows the characteristics of the signal spectrum change.…”

Section: Convolutional Neural Network (Cnn)mentioning

confidence: 99%

A Deep Neural Network Model for Speaker Identification

Yang

2021

Applied Sciences

View full text Add to dashboard Cite

Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural network (RNN). Indeed, these methods perform well in many tasks, but there is no attempt to combine these two network models to study the speaker identification task. Due to the spectrogram that a speech signal contains, the spatial features of voiceprint (which corresponds to the voice spectrum) and CNN are effective for spatial feature extraction (which corresponds to modeling spectral correlations in acoustic features). At the same time, the speech signal is in a time series, and deep RNN can better represent long utterances than shallow networks. Considering the advantage of gated recurrent unit (GRU) (compared with traditional RNN) in the segmentation of sequence data, we decide to use stacked GRU layers in our model for frame-level feature extraction. In this paper, we propose a deep neural network (DNN) model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In the network model design, the convolutional layer is used for voiceprint feature extraction and reduces dimensionality in both the time and frequency domains, allowing for faster GRU layer computation. In addition, the stacked GRU recurrent network layers can learn a speaker’s acoustic features. During this research, we tried to use various neural network structures, including 2-D CNN, deep RNN, and deep LSTM. The above network models were evaluated on the Aishell-1 speech dataset. The experimental results showed that our proposed DNN model, which we call deep GRU, achieved a high recognition accuracy of 98.96%. At the same time, the results also demonstrate the effectiveness of the proposed deep GRU network model versus other models for speaker identification. Through further optimization, this method could be applied to other research similar to the study of speaker identification.

show abstract

A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology

Cited by 20 publications

References 66 publications

An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

An Acoustic Feature-Based Deep Learning Model for Automatic Thai Vowel Pronunciation Recognition

A Deep Neural Network Model for Speaker Identification

Contact Info

Product

Resources

About