Capturing Long-Term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition

Khorram, Soheil; Aldeneh, Zakaria; Dimitriadis, Dimitrios; McInnis, Melvin G.; Provost, Emily Mower

doi:10.21437/interspeech.2017-548

Cited by 30 publications

(34 citation statements)

References 20 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior work has demonstrated the importance of considering long-term context when predicting valence (the same effect has not been shown in activation) [30]. The contextual annotations provided the annotators with this information, but the classifier could not take advantage of this effect.…”

Section: Questionmentioning

confidence: 97%

Muse-ing on the Impact of Utterance Ordering on Crowdsourced Emotion Annotations

Jaiswal

Aldeneh

Bara

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are inherently subjective. There is generally not a single annotation that can be unambiguously declared "correct." As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact on both the annotations themselves and on the performance of these algorithms. We focus on one critical question: the effect of context. We present a new emotion dataset, Multimodal Stressed Emotion (MuSE), and annotate the dataset using two conditions: randomized, in which annotators are presented with clips in random order, and contextualized, in which annotators are presented with clips in order. We find that contextual labeling schemes result in annotations that are more similar to a speaker's own self-reported labels and that labels generated from randomized schemes are most easily predictable by automated systems.

show abstract

Section: Questionmentioning

confidence: 97%

Muse-ing on the Impact of Utterance Ordering on Crowdsourced Emotion Annotations

Jaiswal

Aldeneh

Bara

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…For this reason, we select hyperparameters based on those found to be commonly selected in prior work and keep them constant for all experiments. A channel size of 128 is used for all convolutional and fully connected layers, as commonly selected in prior work [24], [52]. ReLU is used as the activation function for all but the final layer, as it has been show successful in the field and is computationally efficient [24], [53].…”

Section: Cnnmentioning

confidence: 99%

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

Self Cite

View full text Add to dashboard Cite

Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train "meet in the middle" approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings. , and a co-author of the winner of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of the perception and expression of human behavior.

show abstract

“…Stress has been shown to have varying effects on both the linguistic [5] and para-linguistic [37,41] components of communication. Previous work has also demonstrated that the lexical part of speech carries more information about valence while the para-linguistic part carries more information about activation [22]. As a result, we expect the performance of stress classification to vary based on modality, and emotion dimension being modeled.…”

Section: Questionmentioning

confidence: 90%

“…Acoustic. We use Mel Filterbank (MFB) features, which are frequently used in speech processing applications, including speech recognition, and emotion recognition [22,26]. We extract the 40-dimensional MFB features using a 25-millisecond Hamming window with a step-size of 10milliseconds.…”

Section: Featuresmentioning

confidence: 99%

Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning

Jaiswal

Aldeneh

Provost

2019

2019 International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

Various psychological factors affect how individuals express emotions. Yet, when we collect data intended for use in building emotion recognition systems, we often try to do so by creating paradigms that are designed just with a focus on eliciting emotional behavior. Algorithms trained with these types of data are unlikely to function outside of controlled environments because our emotions naturally change as a function of these other factors. In this work, we study how the multimodal expressions of emotion change when an individual is under varying levels of stress. We hypothesize that stress produces modulations that can hide the true underlying emotions of individuals and that we can make emotion recognition algorithms more generalizable by controlling for variations in stress. To this end, we use adversarial networks to decorrelate stress modulations from emotion representations. We study how stress alters acoustic and lexical emotional predictions, paying special attention to how modulations due to stress affect the transferability of learned emotion recognition models across domains. Our results show that stress is indeed encoded in trained emotion classifiers and that this encoding varies across levels of emotions and across the lexical and acoustic modalities. Our results also show that emotion recognition models that control for stress during training have better generalizability when applied to new domains, compared to models that do not control for stress during training. We conclude that is is necessary to consider the effect of extraneous psychological factors when building and testing emotion recognition models.

show abstract

Capturing Long-Term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition

Cited by 30 publications

References 20 publications

Muse-ing on the Impact of Utterance Ordering on Crowdsourced Emotion Annotations

Muse-ing on the Impact of Utterance Ordering on Crowdsourced Emotion Annotations

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning

Contact Info

Product

Resources

About