A Deep Learning Method for Pathological Voice Detection Using Convolutional Deep Belief Networks

Wu, Huiyi; Soraghan, John J.; Lowit, Anja; Caterina, Gaetano Di

doi:10.21437/interspeech.2018-1351

Cited by 72 publications

(43 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For every database, 70% of the data is used in training, 20% is used in testing and the remaining 10% of the speech data is used for validation. This type of data partition has been followed in several previous detection studies related both to traditional pipeline [76] and end-to-end [32], [33] systems. For UA-Speech and TORGO, the database is split in order to maintain a good partition of speakers with different severities or intelligibility scores between the training, validation, and test sets, without having any overlap in speakers between the different sets.…”

Section: B Experimental Setupmentioning

confidence: 99%

“…In studying pathological voice detection with end-to-end systems, previous studies have used either raw time-domain speech signal or its spectrum to train deep learning models [31]- [35]. In order to develop deep learning models, existing studies have mainly used combinations of convolutional neural network (CNN) and multilayer perceptron (MLP) [31], [33]- [37]. In addition, some studies have explored combining CNN and long short-term memory (LSTM) networks [32], and combining LSTM and MLP [38] for detection of pathological voice from healthy speech.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Glottal Source Information for Pathological Voice Detection

Narendra

Alku

2020

IEEE Access

View full text Add to dashboard Cite

Automatic methods for the detection of pathological voice from healthy speech can be considered as potential clinical tools for medical treatment. This study investigates the effectiveness of glottal source information in the detection of pathological voice by comparing the classical pipeline approach to the end-to-end approach. The traditional pipeline approach consists of a feature extractor and a separate classifier. In the former, two sets of glottal features (computed using the quasi-closed phase glottal inverse filtering method) are used together with the widely used openSMILE features. Using both the glottal and openSMILE features extracted from voice utterances and the corresponding healthy/pathology labels, support vector machine (SVM) classifiers are trained. In building end-to-end systems, both raw speech signals and raw glottal flow waveforms are used to train two deep learning architectures: (1) a combination of convolutional neural network (CNN) and multilayer perceptron (MLP), and (2) a combination of CNN and long short-term memory (LSTM) network. Experiments were carried out using three publicly available databases, including dysarthric (the UA-Speech database and the TORGO database) and dysphonic voices (the UPM database). The performance analysis of the detection system based on the traditional pipeline approach showed best results when the glottal features were combined with the baseline openSMILE features. The results of the end-to-end approach indicated higher accuracies (about 2-3 % improvement in all three databases) when glottal flow was used as the raw time-domain input (87.93 % for UA-Speech, 81.12 % for TORGO and 76.66 % for UPM) compared to using raw speech waveform (85.12 % for UA-Speech, 78.83 % for TORGO and 73.71 % for UPM). The evaluation of both approaches demonstrate that automatic detection of pathological voice from healthy speech benefits from using glottal source information. INDEX TERMS Pathological voice, glottal source waveform, glottal features, support vector machines, end-to-end systems.

show abstract

Section: B Experimental Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Glottal Source Information for Pathological Voice Detection

Narendra

Alku

2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Pathological voice disorder, due to vocal cord paralysis or Reinke's edema, is investigated in [112]. In the paper, FIGURE 16.…”

Section: E the Spectrogram Featuresmentioning

confidence: 99%

A Survey on Signal Processing Based Pathological Voice Detection Techniques

2020

View full text Add to dashboard Cite

Voice disability is a barrier to effective communication. Around 1.2% of the World's population is facing some form of voice disability. Surgical procedures namely laryngoscopy, laryngeal electromyography, and stroboscopy are used for voice disability diagnosis. Researchers and practitioners have been working to find alternatives to these surgical procedures. Voice sample based diagnosis is one of them. The major steps followed by these works are (a) to extract voice features from voice samples and (b) to discriminate pathological voices from normal voices by using a classifier algorithm. However, there is no consensus about the voice feature and the classifier algorithm that can provide the best accuracy in screening voice disability. Moreover, some of the works use multiple voice features and multiple classifiers to ensure high reliability. In this paper, we address these issues. The motivation of the work is to address the need for non-invasive signal processing techniques to detect voice disability in the general population. This paper conducts a survey related to voice disability detection methods. The paper contains two main parts. In the first part, we present background information including causes of voice disability, current procedures and practices, voice features, and classifiers. In the second part, we present a comprehensive survey work on voice disability detection algorithms. The issues and challenges related to the selection of voice feature and classifier algorithms have been addressed at the end of this paper. INDEX TERMS Algorithms, issues and challenges, signal processing, surgical methods, survey, voice disability, voice features.

show abstract

“…Noise reduction Background noise is reduced based on the spectral gating algorithm implemented in the SoX codec. 3 The core idea of the algorithm is to attenuate the speech segments in the signal with spectral energy below certain thresholds, which are obtained by computing the mean power on each frequency band from the STFT of a noise profile extracted from a silence region of the speech signal.…”

Section: Preprocessingmentioning

confidence: 99%

“…After the convolution operation, the resulting feature maps contain low-and high-level features representing the acoustic information of the signals. Many works have shown the advantages of using CNNs and spectrograms in different speech processing applications such as automatic detection of disordered speech [2][3][4], acoustic models for automatic speech recognition systems [5,6], and emotion detection [7], among others. These studies, however, consider single-channel spectrograms to obtain the feature maps, e.g., the shorttime Fourier transforms (STFT) are applied to the audio signal and the resulting spectrogram is used as an input to the model.…”

Section: Introductionmentioning

confidence: 99%

Multi-channel spectrograms for speech processing applications using deep learning methods

Arias-Vergara

Klumpp

Vásquez-Correa

et al. 2020

Pattern Anal Applic

View full text Add to dashboard Cite

Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers.

show abstract

A Deep Learning Method for Pathological Voice Detection Using Convolutional Deep Belief Networks

Cited by 72 publications

References 8 publications

Glottal Source Information for Pathological Voice Detection

Glottal Source Information for Pathological Voice Detection

A Survey on Signal Processing Based Pathological Voice Detection Techniques

Multi-channel spectrograms for speech processing applications using deep learning methods

Contact Info

Product

Resources

About