Proceedings of the 14th ACM International Conference on Multimodal Interaction 2012
DOI: 10.1145/2388676.2388780
|View full text |Cite
|
Sign up to set email alerts
|

Step-wise emotion recognition using concatenated-HMM

Abstract: Human emotion is an important part of human-human communication, since the emotional state of an individual often affects the way that he/she reacts to others. In this paper, we present a method based on concatenated Hidden Markov Model (co-HMM) to infer the dimensional and continuous emotion labels from audio-visual cues. Our method is based on the assumption that continuous emotion levels can be modeled by a set of discrete values. Based on this, we represent each emotional dimension by step-wise label class… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 23 publications
(14 citation statements)
references
References 28 publications
0
14
0
Order By: Relevance
“…Note that the results from the top four contenders are based on audio-visual input features, with additional information such as context [41] (i.e., knowing the personality of a virtual human that the user is interacting with) or the duration of conversation [30] (i.e., knowing how long the user has been interacting with a virtual human). Considering our approach is purely vision-based, this is a significant improvement over other approaches.…”
Section: Results On Visual Input Alonementioning
confidence: 99%
See 1 more Smart Citation
“…Note that the results from the top four contenders are based on audio-visual input features, with additional information such as context [41] (i.e., knowing the personality of a virtual human that the user is interacting with) or the duration of conversation [30] (i.e., knowing how long the user has been interacting with a virtual human). Considering our approach is purely vision-based, this is a significant improvement over other approaches.…”
Section: Results On Visual Input Alonementioning
confidence: 99%
“…Table 1 shows the cross-correlation coefficients between predicted and ground-truth labels, averaged over all sequences. We include the baseline result from [37] and the results of the top four contenders from AVEC 2012 [27,41,36,30]. We also include our results on audio-visual input (bottom row), discussed in Section 4.4.…”
Section: Methodsmentioning
confidence: 99%
“…The baseline audio-visual system trained on the union of word-level high-dimensional super-segmental acoustic features and LBP image features obtains a correlation of 0.015 (Schuller et al, 2012), which is much lower than our scores of 0.152 and 0.214 obtained on the best acoustic and lexical representations respectively. Ozkan et al (2012) applied co-HMM fusion with low-dimensional video (smile, gaze, head tilt), acoustic (energy, articulation rate, F0, Peak slope, Spectral stationarity), and scale time features. Their best performance on word-level prediction is 0.200 which is the second place in the word-level competition.…”
Section: Discussionmentioning
confidence: 99%
“…For AVEC 2011: UCL (Meng and Bianchi-Berthouze 2011), Uni-ULM (Glodek et al 2011), GaTechKim (Kim et al 2011), LSU (Calix et al 2011), Waterloo (Sayedelahl et al 2011), NLPR (Pan et al 2011), USC (Ramirez et al 2011), GaTechSun (Sun and Moore 2011), I2R-SCUT (Cen et al 2011), UCR (Cruz et al 2011) and UMontreal (Dahmane and Meunier 2011a, b). For AVEC 2012: UPMC-UAG (Nicolle et al 2012), Supelec-Dynamixyz-MinesTelecom (Soladie et al 2012), UPenn (Savran et al 2012a), USC (Ozkan et al 2012), Delft (van der Maaten 2012), Uni-ULM (Glodek et al 2012), Waterloo2 (Fewzee and Karray 2012). The results obtained by I2R, Cubic-ASU, and the University of Aberystwyth did not result in a publication.…”
Section: Audio/visual Emotion Challenge 2011/2012mentioning
confidence: 96%