A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech

Gallardo-Antolín, Ascensión; Montero, Juan Manuel

doi:10.21437/interspeech.2019-1603

Cited by 9 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, non-relevant frames should be diminished or even ignored, so the values of the corresponding weights should be small. This approach has been proposed with great success in other automatic learning problems that deal with temporal sequences [14,16,17,[19][20][21]25,45], including our previous works on the estimation of the intelligibility level [8,11].…”

Section: Attention Poolingmentioning

confidence: 99%

“…More recently, deep learning (DL) methods have been proposed for SIC as they have been proven to be very effective in several audio and speech-related tasks, such as acoustic event detection [14], automatic speech recognition [15], speech emotion recognition [16][17][18], cognitive load classification from speech [19,20], or deception detection from speech [21]. Recent studies propose the use of dense networks fed by features derived from the decomposition of log-mel spectrograms in temporal and frequency basis vectors [22], the use of convolutional neural networks and different spectro-temporal representations as input [23], or long short-term memory (LSTM) networks with MFCC as feature vectors [24] for multilevel or binary speech intelligibility classification.…”

Section: Introductionmentioning

confidence: 99%

“…This approach is called saliency pooling. In [19], we presented preliminary experiments on cognitive load estimation from speech using a system based on a similar strategy. Here, we deepen into this approach and we apply it to the task of SIC, what, to the best of our knowledge, has not been studied before.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

Gallardo-Antolín

Montero

2021

Symmetry

Self Cite

View full text Add to dashboard Cite

Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli’s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.

show abstract

Section: Attention Poolingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

Gallardo-Antolín

Montero

2021

Symmetry

Self Cite

View full text Add to dashboard Cite

show abstract

“…For this reason, in recent years, speech technologies are being proposed for the assessment, diagnosis and tracking of different health conditions that affect the subject’s voice [ 20 ]. In this area, commonly referred to as Computational Paralinguistic Analysis , current research encompasses the detection of pathological voices due, for example, to laryngeal disorders [ 21 ]; the diagnosis and monitoring of neurodegenerative conditions, such as Parkinson’s disease [ 22 , 23 ], Mild Cognitive Impairment [ 24 ], Alzheimer’s disease [ 24 , 25 ] or Amyotrophic Lateral Sclerosis [ 26 ]; the prediction of stress and cognitive load level [ 27 , 28 ]; and the detection of psychological pathologies, such as autism [ 29 ] or depression [ 30 ], which is the topic of this paper.…”

Section: Related Workmentioning

confidence: 99%

“…Conventional systems for speech-based health tasks consists of data-driven approaches based on hand-crafted acoustic features, such as pitch, prosody, loudness, rate of speech, and energies, among others, and a machine-learning algorithm such as Logistic Regression, Support Vector Machines (SVM) or Gaussian Mixture models [ 22 , 23 , 24 , 29 ]. Nevertheless, very recent works, such as, for example, [ 20 , 21 , 25 , 26 , 27 , 28 ], deal with the use of deep-learning techniques for these tasks, since, presently, these kinds of methods have achieved unprecedented successes in the field of automatic learning applied to signal processing, and particularly in image, video, and audio problems.…”

Section: Related Workmentioning

confidence: 99%

Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks

Vázquez-Romero

Gallardo-Antolín

2020

Entropy

Self Cite

View full text Add to dashboard Cite

This paper proposes a speech-based method for automatic depression classification. The system is based on ensemble learning for Convolutional Neural Networks (CNNs) and is evaluated using the data and the experimental protocol provided in the Depression Classification Sub-Challenge (DCC) at the 2016 Audio–Visual Emotion Challenge (AVEC-2016). In the pre-processing phase, speech files are represented as a sequence of log-spectrograms and randomly sampled to balance positive and negative samples. For the classification task itself, first, a more suitable architecture for this task, based on One-Dimensional Convolutional Neural Networks, is built. Secondly, several of these CNN-based models are trained with different initializations and then the corresponding individual predictions are fused by using an Ensemble Averaging algorithm and combined per speaker to get an appropriate final decision. The proposed ensemble system achieves satisfactory results on the DCC at the AVEC-2016 in comparison with a reference system based on Support Vector Machines and hand-crafted features, with a CNN+LSTM-based system called DepAudionet, and with the case of a single CNN-based classifier.

show abstract

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

Gallardo-Antolín

Montero

2021

Neurocomputing

Self Cite

View full text Add to dashboard Cite

A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech

Cited by 9 publications

References 22 publications

An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

Contact Info

Product

Resources

About