Audio-visual emotion recognition using deep transfer learning and multiple temporal models

Ouyang, Xin; Kawaai, Shigenori; Goh, Ester Gue Hua; Shen, Shengmei; Ding, Wei; Ming, Hong; Huang, Dazhi

doi:10.1145/3136755.3143012

Cited by 81 publications

(52 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, transfer learning methods play a very limited role in this process. Common knowledge transfer in multi-modal methods include fine-tune well trained models to a specific type of signal (Vielzeuf et al, 2017;Yan et al, 2018;Huang et al, 2019;Ortega et al, 2019), or fine-tune different well-trained models to both speech and video signals (Ouyang et al, 2017;Zhang et al, 2017;Ma et al, 2019). Other usage of transfer learning for multi-modal methods includes leveraging the knowledge from one signal to another (e.g., video to speech) to reduce the potential bias (Athanasiadis et al, 2019).…”

Section: Multi-modal Transfer Learning For Emotion Recognitionmentioning

confidence: 99%

A Review of Generalizable Transfer Learning in Automatic Emotion Recognition

Feng

Chaspari

2020

Front. Comput. Sci.

View full text Add to dashboard Cite

Automatic emotion recognition is the process of identifying human emotion from signals such as facial expression, speech, and text. Collecting and labeling such signals is often tedious and many times requires expert knowledge. An effective way to address challenges related to the scarcity of data and lack of human labels, is transfer learning. In this manuscript, we will describe fundamental concepts in the field of transfer learning and review work which has successfully applied transfer learning for automatic emotion recognition. We will finally discuss promising future research directions of transfer learning for improving the generalizability of automatic emotion recognition systems.

show abstract

Section: Multi-modal Transfer Learning For Emotion Recognitionmentioning

confidence: 99%

A Review of Generalizable Transfer Learning in Automatic Emotion Recognition

Feng

Chaspari

2020

Front. Comput. Sci.

View full text Add to dashboard Cite

show abstract

“…Compared with RNN, CNN is more suitable for computer vision applications; hence, its derivative C3D [107], which uses 3D convolutional kernels with shared weights along the time axis instead of the traditional 2D kernels, has been widely used for dynamic-based FER (e.g., [83], [108], [189], [197], [198]) to capture the spatio-temporal features. Based on C3D, many derived structures have been designed for FER.…”

Section: Rnn and C3dmentioning

confidence: 99%

Deep Facial Expression Recognition: A Survey

Deng

2022

IEEE Trans. Affective Comput.

1,019

593

View full text Add to dashboard Cite

With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic problems. First, we introduce the available datasets that are widely used in the literature and provide accepted data selection and evaluation principles for these datasets. We then describe the standard pipeline of a deep FER system with the related background knowledge and suggestions of applicable implementations for each stage. For the state of the art in deep FER, we review existing novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.

show abstract

“…The study [8] represents 27 distinct possible categories of human emotion but in case of music video, it is convenient to organize them with coarse semantic groups so that an end-user can easily demand the required music video from large video banks or online music video stores. We categorize the adjectives of music video emotion classification into six basic emotion categories with references [41,52,67], namely, Exciting, Fear, Neutral, Relaxation, Sad, and Tension. From each emotion class, respectively three samples are represented (from left to right) in Fig.…”

Section: Music Video Emotion Datasetmentioning

confidence: 99%

“…An extension on face emotion analysis is proposed on [69] using an audio spectrogram and human face image based on an integrated multimodal architecture. The multimodal approaches [11,13,41,44] have proposed audio and video by using a recurrent network with LSTM cells for face video emotion recognition. The one-dimensional (1D) audio network-and 2D video network-based multimodal [61] for speech recognition uses hybrid information fusion techniques by adding recurrent neural network after concatenation of learned features.…”

Section: Introductionmentioning

confidence: 99%

Deep learning-based late fusion of multimodal information for emotion classification of music video

Pandeya

Lee

2020

Multimed Tools Appl

113

View full text Add to dashboard Cite

Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.

show abstract

Audio-visual emotion recognition using deep transfer learning and multiple temporal models

Cited by 81 publications

References 21 publications

A Review of Generalizable Transfer Learning in Automatic Emotion Recognition

A Review of Generalizable Transfer Learning in Automatic Emotion Recognition

Deep Facial Expression Recognition: A Survey

Deep learning-based late fusion of multimodal information for emotion classification of music video

Contact Info

Product

Resources

About