2015
DOI: 10.1016/j.neunet.2015.09.009
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal emotional state recognition using sequence-dependent deep hierarchical features

Abstract: Emotional state recognition has become an important topic for human-robot interaction in the past years. By determining emotion expressions, robots can identify important variables of human behavior and use these to communicate in a more human-like fashion and thereby extend the interaction possibilities. Human emotions are multimodal and spontaneous, which makes them hard to be recognized by robots. Each modality has its own restrictions and constraints which, together with the non-structured behavior of spon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
22
0
6

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 66 publications
(30 citation statements)
references
References 31 publications
0
22
0
6
Order By: Relevance
“…When using 3D CNN for spatio-temporal modeling of image sequences as discussed in Section 3.1.1, the line between spatial and temporal representation learning can be blurred. While this approach is typically limited to very short sequences, with further pooling steps necessary to derive sequence-level labels (e.g., [84], [85]), in some cases spatio-temporal features can be derived for entire (short) sequences. For example, Gupta et al [62] used a variant called slow fusion [153], which treats the time domain like a spatial domain, progressively learning low-level to highlevel temporal features.…”
Section: Learning Temporal Features For Fermentioning
confidence: 99%
“…When using 3D CNN for spatio-temporal modeling of image sequences as discussed in Section 3.1.1, the line between spatial and temporal representation learning can be blurred. While this approach is typically limited to very short sequences, with further pooling steps necessary to derive sequence-level labels (e.g., [84], [85]), in some cases spatio-temporal features can be derived for entire (short) sequences. For example, Gupta et al [62] used a variant called slow fusion [153], which treats the time domain like a spatial domain, progressively learning low-level to highlevel temporal features.…”
Section: Learning Temporal Features For Fermentioning
confidence: 99%
“…To be able to deal with multimodal data, our network uses the concept of the CCCNN by Barros, Jirak, Weber, and Wermter (2015a). In the CCCNN architecture, several channels, each one of them composed of an independent sequence of convolution and pooling layers, are fully connected at the end to a crosschannel layer, which is composed of convolution and pooling layers, and trained as one single architecture.…”
Section: Emotion Expression Representationmentioning
confidence: 99%
“…Chen et al [129] used HOG on the motion history image (MHI) for finding the direction and speed, and Image-HOG features from bag of words (BOW) to compute appearance features. Another example is the usage of a multichannel CNN for learning a deep representation from the upper part of the body [130]. Finally, Botzheim et al [131] used spiking neural networks for temporal coding.…”
Section: Representation Learningmentioning
confidence: 99%
“…2) for learning such representations in a supervised way. Two of the very few works that uses deep learning representations for body emotion recognition are multichannel CNN from upper body [130] and spiking neural networks for temporal coding [131]. As previously discussed in Sec.…”
Section: Representation Learning and Emotion Recognitionmentioning
confidence: 99%