At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech

Schmitt, Maximilian; Ringeval, Fabien; Schuller, Björn

doi:10.21437/interspeech.2016-1124

Cited by 116 publications

(101 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The bag-of-words (BoW) approach is known from natural language processing [30], which can be referred to the early description in [12]. Particularly, in the application of speech emotion recognition [21,26], and the area of health care [13,25], the BoAW approach achieved numerous excellent results. Motivated by the success of the BoAW method in aforementioned studies, we propose the badof-behaviour-words (BoBW) approach.…”

Section: Bag-of-behaviour-words Approachmentioning

confidence: 99%

Automatic Detection of Major Depressive Disorder via a Bag-of-Behaviour-Words Approach

Qian

Kuromiya

Ren

et al. 2019

Proceedings of the Third International Symposium on Image Computing and Digital Medicine

Self Cite

View full text Add to dashboard Cite

In recent years, machine learning has been increasingly applied to the area of mental health diagnosis, treatment, support, research, and clinical administration. In particular, using less-invasive wearables combined with the artificial intelligence to monitor, or diagnose the mental diseases has tremendous needs in real practice. To this end, we propose a novel approach for automatic detection of major depressive disorder. Firstly, spontaneous activity physical data are recorded by a watch-type device equipped with an activity monitor. Subsequently, a bag-of-behaviour-words approach is applied to extract higher representations from the raw sensor data in an unsupervised scenario. Finally, a support vector machine is selected as the classifier to make the predictions on screening the major depressive disorder. There are 69 healthy control subjects, and 14 major depressive disorder patients involved in this study. The experimental results demonstrate the effectiveness of the proposed method in a rigorous subject-independent test, which achieves an unweighted average recall at 59.3 % (an accuracy of 66.0 %). This unweighted average recall significantly ( < .05, onetailed -test) outperforms human hand-crafted features with an unweighted average recall at 53.6 % (an accuracy of 61.7 %).

show abstract

Section: Bag-of-behaviour-words Approachmentioning

confidence: 99%

Automatic Detection of Major Depressive Disorder via a Bag-of-Behaviour-Words Approach

Qian

Kuromiya

Ren

et al. 2019

Proceedings of the Third International Symposium on Image Computing and Digital Medicine

Self Cite

View full text Add to dashboard Cite

show abstract

“…Emotion recognition from audiovisual signals usually relies on feature sets whose extraction is based on expertise gained over several decades of research in the domains of speech processing, e. g., Mel Frequency Cepstral Coefficients (MFCCs), and vision computing, e. g., Facial Action Units (FAUs). However, recent advances in the field of representation learning, whose objective is to learn representations of data that are best suited for the recognition task [6], have shown that efficient representations of audiovisual signals can be learnt in the context of emotion [2,59,71].…”

Section: Baseline Featuresmentioning

confidence: 99%

AVEC 2018 Workshop and Challenge

Ringeval

Schuller

Valstar

et al. 2018

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

Self Cite

120

View full text Add to dashboard Cite

The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from reallife data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind *

show abstract

“…Further, we will look into the data imbalance effects of the database and how this could possibly improve robustness. Moreover, we will combine LSTM and GRU networks on the recently proposed Bag-Of-AudioWords approach [30]. Finally, we also plan to do a full endto-end training of the combined feature and posterior models and examine other network architectures, such as variants of the LSTM models or Convolutional Neural Networks.…”

Section: Discussionmentioning

confidence: 99%

Spotting Social Signals in Conversational Speech over IP: A Deep Learning Perspective

et al. 2017

Self Cite

View full text Add to dashboard Cite

The automatic detection and classification of social signals is an important task, given the fundamental role nonverbal behavioral cues play in human communication. We present the first cross-lingual study on the detection of laughter and fillers in conversational and spontaneous speech collected 'in the wild' over IP (internet protocol). Further, this is the first comparison of LSTM and GRU networks to shed light on their performance differences. We report frame-based results in terms of the unweighted-average area-under-the-curve (UAAUC) measure and will shortly discuss its suitability for this task. In the mono-lingual setup our best deep BLSTM system achieves 87.0 % and 86.3 % UAAUC for English and German, respectively. Interestingly, the cross-lingual results are only slightly lower, yielding 83.7 % for a system trained on English, but tested on German, and 85.0 % in the opposite case. We show that LSTM and GRU architectures are valid alternatives for e. g., on-line and compute-sensitive applications, since their application incurs a relative UAAUC decrease of only approximately 5% with respect to our best systems. Finally, we apply additional smoothing to correct for erroneous spikes and drops in the posterior trajectories to obtain an additional gain in all setups.

show abstract

At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech

Cited by 116 publications

References 18 publications

Automatic Detection of Major Depressive Disorder via a Bag-of-Behaviour-Words Approach

Automatic Detection of Major Depressive Disorder via a Bag-of-Behaviour-Words Approach

AVEC 2018 Workshop and Challenge

Spotting Social Signals in Conversational Speech over IP: A Deep Learning Perspective

Contact Info

Product

Resources

About