Autism spectrum disorder (ASD) is a developmental disorder with a life-span disability. While diagnostic instruments have been developed and qualified based on the accuracy of the discrimination of children with ASD from typical development (TD) children, the stability of such procedures can be disrupted by limitations pertaining to time expenses and the subjectivity of clinicians. Consequently, automated diagnostic methods have been developed for acquiring objective measures of autism, and in various fields of research, vocal characteristics have not only been reported as distinctive characteristics by clinicians, but have also shown promising performance in several studies utilizing deep learning models based on the automated discrimination of children with ASD from children with TD. However, difficulties still exist in terms of the characteristics of the data, the complexity of the analysis, and the lack of arranged data caused by the low accessibility for diagnosis and the need to secure anonymity. In order to address these issues, we introduce a pre-trained feature extraction auto-encoder model and a joint optimization scheme, which can achieve robustness for widely distributed and unrefined data using a deep-learning-based method for the detection of autism that utilizes various models. By adopting this auto-encoder-based feature extraction and joint optimization in the extended version of the Geneva minimalistic acoustic parameter set (eGeMAPS) speech feature data set, we acquire improved performance in the detection of ASD in infants compared to the raw data set.
In this paper, we propose an end-to-end (E2E) neural network model to detect autism spectrum disorder (ASD) from children’s voices without explicitly extracting the deterministic features. In order to obtain the decisions for discriminating between the voices of children with ASD and those with typical development (TD), we combined two different feature-extraction models and a bidirectional long short-term memory (BLSTM)-based classifier to obtain the ASD/TD classification in the form of probability. We realized one of the feature extractors as the bottleneck feature from an autoencoder using the extended version of the Geneva minimalistic acoustic parameter set (eGeMAPS) input. The other feature extractor is the context vector from a pretrained wav2vec2.0-based model directly applied to the waveform input. In addition, we optimized the E2E models in two different ways: (1) fine-tuning and (2) joint optimization. To evaluate the performance of the proposed E2E models, we prepared two datasets from video recordings of ASD diagnoses collected between 2016 and 2018 at Seoul National University Bundang Hospital (SNUBH), and between 2019 and 2021 at a Living Lab. According to the experimental results, the proposed wav2vec2.0-based E2E model with joint optimization achieved significant improvements in the accuracy and unweighted average recall, from 64.74% to 71.66% and from 65.04% to 70.81%, respectively, compared with a conventional model using autoencoder-based BLSTM and the deterministic features of the eGeMAPS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.