Diagnosing Dysarthria with Long Short-Term Memory Networks

Mayle, Alex; Mou, Zhiwei; Bunescu, Răzvan; Mirshekarian, Sadegh; Xu, Li; Liu, Chang

doi:10.21437/interspeech.2019-2903

Cited by 17 publications

(10 citation statements)

References 18 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the classifier, most of the previous investigations have used support vector machines (SVMs) [5], [13], [16], [22]. In addition to SVMs, other algorithms such as artificial neural networks, decision trees, and variants of recurrent neural network (RNN) have also been used as classifiers in the study area [13], [28], [32]- [34]. A review of various techniques considered for both parts is given in [5].…”

Section: Introductionmentioning

confidence: 99%

A Comparison of Cepstral Features in the Detection of Pathological Voices by Varying the Input and Filterbank of the Cepstrum Computation

Reddy¹,

Alku

2021

IEEE Access

View full text Add to dashboard Cite

Automatic voice pathology detection enables objective assessment of pathologies that affect the voice production mechanism. Detection systems have been developed using the traditional pipeline approach (consisting of the feature extraction part and the detection part) and using the modern deep learning -based end-to-end approach. Due to the lack of vast amounts of training data in the study area of pathological voice, the former approach is still a valid choice. In the existing detection systems based on the traditional pipeline approach, the mel-frequency cepstral coefficient (MFCC) features can be regarded as the defacto standard feature set. In this study, automatic voice pathology detection is investigated by comparing the performance of various MFCC variants derived by considering two factors: the input and the filterbank in the cepstrum computation. For the first factor, three inputs (the voice signal, the glottal source and the vocal tract) are compared. The glottal source and the vocal tract are estimated using the quasi-closed phase glottal inverse filtering method. For the second factor, the mel-frequency and linear-frequency filterbanks are compared. Experiments were conducted separately using six databases consisting of voices produced by speakers suffering from one of four disorders (dysphonia, Parkinson's disease, laryngitis, or heart failure) and by healthy speakers. Support vector machine (SVM) was used as the classifier. The results show that by combining mel-and linear-frequency cepstral coefficients derived from the glottal source and vocal tract, better overall detection accuracy was obtained compared to the defacto MFCC features derived from the voice signal. Furthermore, this combination provided comparable or better performance than four existing cepstral feature extraction techniques in clean and high signal-to-noise ratio (SNR) conditions.INDEX TERMS Voice disorders, glottal inverse filtering, support vector machine, cepstral coefficients.

show abstract

Section: Introductionmentioning

confidence: 99%

A Comparison of Cepstral Features in the Detection of Pathological Voices by Varying the Input and Filterbank of the Cepstrum Computation

Reddy¹,

Alku

2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…For the classifier, most of the previous investigations have used support vector machines (SVMs) [6], [16], [22], [24]- [26]. In addition to SVMs, other algorithms such as artificial neural networks [22], [27], decision trees [28], linear discriminant analysis (LDA) [23], [29], and variants of recurrent neural network (RNN) [30] have also been used as classifiers in the study area. Even though existing detection studies have trained data-driven models with many different types of features, there still exists a need for novel features which are effective and robust when used with different pathological voice databases.…”

Section: Introductionmentioning

confidence: 99%

Glottal Source Information for Pathological Voice Detection

Narendra

Alku

2020

IEEE Access

View full text Add to dashboard Cite

Automatic methods for the detection of pathological voice from healthy speech can be considered as potential clinical tools for medical treatment. This study investigates the effectiveness of glottal source information in the detection of pathological voice by comparing the classical pipeline approach to the end-to-end approach. The traditional pipeline approach consists of a feature extractor and a separate classifier. In the former, two sets of glottal features (computed using the quasi-closed phase glottal inverse filtering method) are used together with the widely used openSMILE features. Using both the glottal and openSMILE features extracted from voice utterances and the corresponding healthy/pathology labels, support vector machine (SVM) classifiers are trained. In building end-to-end systems, both raw speech signals and raw glottal flow waveforms are used to train two deep learning architectures: (1) a combination of convolutional neural network (CNN) and multilayer perceptron (MLP), and (2) a combination of CNN and long short-term memory (LSTM) network. Experiments were carried out using three publicly available databases, including dysarthric (the UA-Speech database and the TORGO database) and dysphonic voices (the UPM database). The performance analysis of the detection system based on the traditional pipeline approach showed best results when the glottal features were combined with the baseline openSMILE features. The results of the end-to-end approach indicated higher accuracies (about 2-3 % improvement in all three databases) when glottal flow was used as the raw time-domain input (87.93 % for UA-Speech, 81.12 % for TORGO and 76.66 % for UPM) compared to using raw speech waveform (85.12 % for UA-Speech, 78.83 % for TORGO and 73.71 % for UPM). The evaluation of both approaches demonstrate that automatic detection of pathological voice from healthy speech benefits from using glottal source information. INDEX TERMS Pathological voice, glottal source waveform, glottal features, support vector machines, end-to-end systems.

show abstract

“…Detecting dysarthria involves extracting hand-crafted acoustic features and using those features as inputs to a machine learning-based classifier [18][19][20]. Deep learning approaches are also possible where the raw speech signal or a set of elementary features are fed into complex neural network architectures that automatically determine the important acoustic information and distinguish between healthy and dysarthric speech [21,22]. Deep learning approaches require less data preparation and feature engineering but may suffer from a lack of interpretability as further post-processing is often required to interpret how the speaker's speech is impaired.…”

Section: Introductionmentioning

confidence: 99%

“…Various types of acoustic features have been proposed for detecting dysarthric speech. Spectral features such as Mel Frequency Cepstral Coefficients (MFCCs) are used in References [22,23], and filter banks are utilized in long short-term memory classifiers [21] and convolutional neural networks [24]. Spectral measures of fricatives are shown to significantly differ between healthy and dysarthric speakers in Reference [25] and are used as input to a machine learning classifier in Reference [26].…”

Section: Introductionmentioning

confidence: 99%

Prosody-Based Measures for Automatic Severity Assessment of Dysarthric Speech

2020

View full text Add to dashboard Cite

One of the first cues for many neurological disorders are impairments in speech. The traditional method of diagnosing speech disorders such as dysarthria involves a perceptual evaluation from a trained speech therapist. However, this approach is known to be difficult to use for assessing speech impairments due to the subjective nature of the task. As prosodic impairments are one of the earliest cues of dysarthria, the current study presents an automatic method of assessing dysarthria in a range of severity levels using prosody-based measures. We extract prosodic measures related to pitch, speech rate, and rhythm from speakers with dysarthria and healthy controls in English and Korean datasets, despite the fact that these two languages differ in terms of prosodic characteristics. These prosody-based measures are then used as inputs to random forest, support vector machine and neural network classifiers to automatically assess different severity levels of dysarthria. Compared to baseline MFCC features, 18.13% and 11.22% relative accuracy improvement are achieved for English and Korean datasets, respectively, when including prosody-based features. Furthermore, most improvements are obtained with a better classification of mild dysarthric utterances: a recall improvement from 42.42% to 83.34% for English speakers with mild dysarthria and a recall improvement from 36.73% to 80.00% for Korean speakers with mild dysarthria.

show abstract

Diagnosing Dysarthria with Long Short-Term Memory Networks

Cited by 17 publications

References 18 publications

A Comparison of Cepstral Features in the Detection of Pathological Voices by Varying the Input and Filterbank of the Cepstrum Computation

A Comparison of Cepstral Features in the Detection of Pathological Voices by Varying the Input and Filterbank of the Cepstrum Computation

Glottal Source Information for Pathological Voice Detection

Prosody-Based Measures for Automatic Severity Assessment of Dysarthric Speech

Contact Info

Product

Resources

About