Audio-Visual Affect Recognition

Zeng, Zhihong; Tu, Jilin; Liu, M.; Huang, Thomas S.; Pianfetti, B.; Roth, Dan; Levinson, Stephen E.

doi:10.1109/tmm.2006.886310

Cited by 122 publications

(47 citation statements)

References 16 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Facial expressions [20], [21], [22], vocal features [23] [24] [25], body movements and postures [26], [27], [11], [28], physiological signals [29] have been used as inputs during these attempts, although multimodal emotion recognition is currently gaining ground [7], [30], [31], [32], [33]. Nevertheless, most of the work has considered the integration of information from facial expressions and speech [34], [35] and there have been relatively few attempts to combine information from body movement and gestures in a multimodal framework. Gunes and Piccardi [8], for example, fused facial expressions and body gestures at different levels for bimodal emotion recognition.…”

Section: Related Workmentioning

confidence: 99%

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

Kessous¹,

Castellano

2009

J Multimodal User Interfaces

195

View full text Add to dashboard Cite

In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., 'face-gesture', 'facespeech' and 'gesture-speech') was also investigated. The results show that the best pairing is 'gesture-speech'. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.

show abstract

Section: Related Workmentioning

confidence: 99%

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

Kessous¹,

Castellano

2009

J Multimodal User Interfaces

195

View full text Add to dashboard Cite

show abstract

“…In these datasets, interactions include two interlocutors, who are recorded carrying out both structured and unstructured conversations. ere are numerous examples of such datasets, with a wide range of applications, such as speech recognition [15], behavior analysis [50], segmentation, emotion recognition [12] and depression detection [16]. Arguably, one of the most popular datasets of one-to-one interactions is SEMAINE [30].…”

Section: Related Workmentioning

confidence: 99%

The NoXi database: multimodal recordings of mediated novice-expert interactions

Cafaro

Wagner

Baur

et al. 2017

Proceedings of the 19th ACM International Conference on Multimodal Interaction

View full text Add to dashboard Cite

We present a novel multi-lingual database of natural dyadic noviceexpert interactions, named NoXi, featuring screen-mediated dyadic human interactions in the context of information exchange and retrieval. NoXi is designed to provide spontaneous interactions with emphasis on adaptive behaviors and unexpected situations (e.g. conversational interruptions). A rich set of audio-visual data, as well as continuous and discrete annotations are publicly available through a web interface. Descriptors include low level social signals (e.g. gestures, smiles), functional descriptors (e.g. turn-taking, dialogue acts) and interaction descriptors (e.g. engagement, interest, and uidity). CCS CONCEPTS•Information systems → Database design and models; Semistructured data; Data streams; •Human-centered computing → Systems and tools for interaction design; KEYWORDS A ective computing, multimodal corpora, multimedia databases ACM Reference format:

show abstract

“…A number of studies favor decision-level fusion as the preferred method of data fusion because errors from different classifiers tend to be uncorrelated and the methodology is feature-independent [66]. Bimodal fusion methods have been proposed in numerous instances [12,67,68], but optimal information fusion configurations remain elusive.…”

Section: Multimodal Fusionmentioning

confidence: 99%

Towards an intelligent framework for multimodal affective data analysis

et al. 2015

View full text Add to dashboard Cite

An increasingly large amount of multimodal content is posted on social media websites such as YouTube and Facebook everyday. In order to cope with the growth of such so much multimodal data, there is an urgent need to develop an intelligent multi-modal analysis framework that can effectively extract information from multiple modalities. In this paper, we propose a novel multimodal information extraction agent, which infers and aggregates the semantic and affective information associated with user-generated multimodal data in contexts such as e-learning, e-health, automatic video content tagging and human-computer interaction. In particular, the developed intelligent agent adopts an ensemble feature extraction approach by exploiting the joint use of tri-modal (text, audio and video) features to enhance the multimodal information extraction process. In preliminary experiments using the eNTERFACE dataset, our proposed multi-modal system is shown to achieve an accuracy of 87.95%, outperforming the best state-of-the-art system by more than 10%, or in relative terms, a 56% reduction in error rate

show abstract

Audio-Visual Affect Recognition

Cited by 122 publications

References 16 publications

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

The NoXi database: multimodal recordings of mediated novice-expert interactions

Towards an intelligent framework for multimodal affective data analysis

Contact Info

Product

Resources

About