Voice Pathology Detection Using Modulation Spectrum-Optimized Metrics

Biomedical Signal Processing and Control

Moro-Velázquez

Godino-Llorente

2019

This is the second of a two-part series devoted to the automatic voice condition analysis of voice pathologies, being a direct continuation to the paper "On the design of automatic voice condition analysis systems. Part I: review of concepts and an insight to the state of the art". The aim of this study is to examine several variability factors affecting the robustness of systems that automatically detect the presence of voice pathologies by means of audio registers. Multiple experiments are performed to test out the influence of the speech task, extralinguistic aspects (such as sex), the acoustic features and the classifiers in their performance. Some experiments are carried out using state-of-the-art classification methodologies often employed in speaker recognition. In order to evaluate the robustness of the methods, testing is repeated across several corpora with the aim to create a single system integrating the conclusions obtained previously. This system is later tested under cross-dataset scenarios in an attempt to obtain more realistic conclusions. Results identify a reduced subset of relevant features, which are used in a hierarchical-like scenario incorporating information of different speech tasks. In particular, for the experiments carried out using the Saarbrüecken voice dataset, the area under the ROC curve of the system reached 0.88 in an intra-dataset setting and ranged from 0.82 to 0.94 in cross-dataset scenarios. These results let us open a discussion about the suitability of these techniques to be transfered to the clinical setting.

Section: Discussionmentioning

confidence: 99%

Section: Ancillary Datasetsmentioning

confidence: 99%

On the design of automatic voice condition analysis systems. Part II: Review of speaker recognition techniques and study on the effects of different variability factors

Biomedical Signal Processing and Control

Moro-Velázquez

Godino-Llorente

2019

“…Likewise, windows of 55 ms length are used with the complexity features as suggested in [8]. Finally, for the experiments in the modulation spectrum set, frames of 180 ms are utilized as suggested in [6], [7].…”

Section: B Methodologymentioning

confidence: 99%

ByoVoz Automatic Voice Condition Analysis System for the 2018 FEMH Challenge

Arias-Londoño

2018 IEEE International Conference on Big Data (Big Data)

Moro-Velázquez

et al. 2018

Self Cite

This paper presents the methods and results used by the ByoVoz team for the design of an automatic voice condition analysis system, which was submitted to the 2018 Far East Memorial Hospital voice data challenge. The proposed methodology is based on a cascading scheme that firstly discriminates between pathological and normophonic voices, and then identifies the type of disorder. By using diverse feature selection techniques, a subset of complexity, spectral/cepstral and perturbation characteristics were identified for the proposed tasks. Then, several generative classification methodologies based on Gaussian Mixture Models and Gradient Boosting were employed to provide decisions about the input voices in the binary classification, and using onevs-one classification systems based on Random Forests for the categorization according to the type of disorder. By using a 4-folds cross-validation approach on the training partition a sensitivity=0.93 and specificity=0.74 were obtained. Similarly, an unweighted average recall of 0.63 and an accuracy of 66% was obtained for the identification task. Using the scoring metric proposed in the challenge the final resulting score considering both detection and identification is of 0.77.

“…For the purposes of this paper, a representation learning approach based on MS is employed to characterize modulation and acoustic frequencies of input voices [39], following a short-time basis using frames of 180 ms as proposed in [5], [40]. The MS have been successfully used in different works related with the characterization of pathological voices, but because of the large amount of data they contain, it is always necessary to extract some hand tuned statistics [5], [40] or to use feature selection techniques [41]. In the representation learning approach considered in this paper, Convolutional Neural Network (CNN) are used to automatically extract information from MS in the context of voice quality assessment.…”

Section: A Characterizationmentioning

confidence: 99%

Multimodal and Multi-Output Deep Learning Architectures for the Automatic Assessment of Voice Quality Using the GRB Scale

Arias-Londoño

IEEE J. Sel. Top. Signal Process.

Godino-Llorente

2020

This paper addresses the automatic assessment of voice quality according to the GRB scale, based on the use of various deep learning architectures for prediction purposes. The proposed architectures are multimodal, because they employ multiples sources of information, and also multi-output because they simultaneously predict all the traits of the GRB scale. A feature engineering approach is followed, based on the use of deep neural networks and a set of well-established features such as MFCC, perturbation and complexity characteristics. Likewise, a representation learning is considered, using convolutional neural networks feed on modulation spectra extracted from voices. Finally, a variety of loss functions are also investigated, including two surrogate ordinal classification, a conventional weighed categorical cross-entropy, and a mean square error function. Experiments are carried out in a dataset containing registers of the sustained phonation of three vowels. The best deep learning architecture provides a relative performance improvement of 6.25% for G, 14.1% for R and 18.1% for B, in comparison with recently published results using the same dataset.