Survey on audiovisual emotion recognition: databases, features, and data fusion strategies

Wu, Chung Hsien; Lin, Jen-Chun; Wei, Wen Li

doi:10.1017/atsip.2014.11

Cited by 141 publications

(86 citation statements)

References 100 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In [6] C. H. wu et al presented a survey on the theoretical and practical work offering new and broad views of the latest research in emotion recognition from bimodal information including facial and vocal expressions is provided.…”

Section: Literature Surveymentioning

confidence: 99%

Recognition and Classification of Human Emotion from Audio

Pawar¹

2017

IJARCSSE

View full text Add to dashboard Cite

Abstract-In this paper, the audio emotion recognition system is proposed that uses a mixture of rule-based and machine learning techniques to improve the recognition efficacy in the audio paths. The audio path is designed using a combination of input prosodic features (pitch, log-energy, zero crossing rates and Teager energy operator) and spectral features (Mel-scale frequency cepstral coefficients). Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech feature extraction and current research aims to identify performance enhancements. After the MFCC feature extraction, these features are passed to three parallel sub-paths which use feature extraction and classification techniques (i.e. BDPCA+LSLDA+RBF). In addition, Naïve Bays and SVM classifier are presented with BDPCS and LSLDA for evaluation of emotion. The extracted audio features are passed into an audio feature level fusion module that uses a set of rules to determine the most likely emotion contained in the audio signal. The performances of the proposed audio path and the final system are evaluated on standard databases of audio clips extracted from the video.Keywords-Emotion recognition, audio-visual processing, rule-based, machine learning, multimodal system I. INTRODUCTIONEmotion recognition is an automated process to identify the affective state of a person and has gained the increasing attention of researchers in the human-computer interaction (HCI) field for various applications like automotive safety, gaming experiences, mental diagnosis in military service, customer services, etc. Over the decades, several research efforts have been conducted for audio-visual emotion recognition. In the literature, three main approaches can be broadly distinguished: (i) audio based approaches, (ii) visual-based approaches, and (iii) audio-visual approaches. Initial works focused on treating the audio data and visual data modalities separately. The audio-based emotion recognition efforts are based on extracting and recognizing the emotional states contained in the human speech signal. An important issue is the selection of the salient features to be used for discriminating the different emotions. Two types of features have been found to be useful for recognizing emotion in speech: prosodic and spectral features. Examples of commonly used prosodic features are pitch and energy and examples of commonly used spectral features are Mel-scale frequency cepstral coefficients (MFCC). Although prosodic features are commonly used in many works some researchers have demonstrated the usefulness of spectral features for speech emotion recognition. The monograph work further investigated combining different types of features like prosodic and spectral features for audio-based emotion recognition. The visual-based emotion recognition efforts are based on extracting and recognizing the emotional states contained in the human facial expression. An example is a recent work by Tawari & Trivedi which used a representation of image sequen...

show abstract

Section: Literature Surveymentioning

confidence: 99%

Recognition and Classification of Human Emotion from Audio

Pawar¹

2017

IJARCSSE

View full text Add to dashboard Cite

show abstract

“…What was of interest to us was the feature fusion. In general, the fusion methods used in multimodal continuous dimensional emotion recognition can be divided into feature level, decision level, model level fusion, and mixed approaches [1] [7].…”

Section: Introductionmentioning

confidence: 99%

“…For feature level fusion, the information from multiple modalities is combined to generate the recognition feature [1] [7]. The simplest method is to construct a joint feature as the input of a regression model by concatenating the features from all modalities [1] [7][11] [16]- [19]. Additionally, many other feature-level fusion strategies have been proposed.…”

Section: Introductionmentioning

confidence: 99%

Incomplete Cholesky Decomposition based Kernel Cross Modal Factor Analysis for Audiovisual Continuous Dimensional Emotion Recognition

2019

KSII TIIS

View full text Add to dashboard Cite

Recently, continuous dimensional emotion recognition from audiovisual clues has attracted increasing attention in both theory and in practice. The large amount of data involved in the recognition processing decreases the efficiency of most bimodal information fusion algorithms. A novel algorithm, namely the incomplete Cholesky decomposition based kernel cross factor analysis (ICDKCFA), is presented and employed for continuous dimensional audiovisual emotion recognition, in this paper. After the ICDKCFA feature transformation, two basic fusion strategies, namely feature-level fusion and decision-level fusion, are explored to combine the transformed visual and audio features for emotion recognition. Finally, extensive experiments are conducted to evaluate the ICDKCFA approach on the AVEC 2016 Multimodal Affect Recognition Sub-Challenge dataset. The experimental results show that the ICDKCFA method has a higher speed than the original kernel cross factor analysis with the comparable performance. Moreover, the ICDKCFA method achieves a better performance than other common information fusion methods, such as the Canonical correlation analysis, kernel canonical correlation analysis and cross-modal factor analysis based fusion methods.

show abstract

“…In the multi-modal fusion domain, many approaches attempted to jointly learn temporal features from multiple modalities (Wu et al, 2014a), such as feature-level (early) fusion (Ngiam et al, 2011;Ramanishka et al, 2016), decision-level (late) fusion (He et al, 2015), model-level fusion (Wu et al, 2014b), and attention fusion (Chen Ground Truth: A girl is singing.…”

Section: Introductionmentioning

confidence: 99%

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

Wang

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different granularities are rarely explored, and how to selectively fuse the multi-modal representations at different levels of details remains uncharted. In this paper, we propose a novel hierarchically aligned cross-modal attention (HACA) framework to learn and selectively fuse both global and local temporal dynamics of different modalities. Furthermore, for the first time, we validate the superior performance of the deep audio features on the video captioning task. Finally, our HACA model significantly outperforms the previous best systems and achieves new state-of-the-art results on the widely used MSR-VTT dataset.

show abstract

Survey on audiovisual emotion recognition: databases, features, and data fusion strategies

Cited by 141 publications

References 100 publications

Recognition and Classification of Human Emotion from Audio

Recognition and Classification of Human Emotion from Audio

Incomplete Cholesky Decomposition based Kernel Cross Modal Factor Analysis for Audiovisual Continuous Dimensional Emotion Recognition

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

Contact Info

Product

Resources

About