Proceedings of the 19th ACM International Conference on Multimodal Interaction 2017
DOI: 10.1145/3136755.3143012
|View full text |Cite
|
Sign up to set email alerts
|

Audio-visual emotion recognition using deep transfer learning and multiple temporal models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
45
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 81 publications
(52 citation statements)
references
References 21 publications
0
45
0
Order By: Relevance
“…However, transfer learning methods play a very limited role in this process. Common knowledge transfer in multi-modal methods include fine-tune well trained models to a specific type of signal (Vielzeuf et al, 2017;Yan et al, 2018;Huang et al, 2019;Ortega et al, 2019), or fine-tune different well-trained models to both speech and video signals (Ouyang et al, 2017;Zhang et al, 2017;Ma et al, 2019). Other usage of transfer learning for multi-modal methods includes leveraging the knowledge from one signal to another (e.g., video to speech) to reduce the potential bias (Athanasiadis et al, 2019).…”
Section: Multi-modal Transfer Learning For Emotion Recognitionmentioning
confidence: 99%
“…However, transfer learning methods play a very limited role in this process. Common knowledge transfer in multi-modal methods include fine-tune well trained models to a specific type of signal (Vielzeuf et al, 2017;Yan et al, 2018;Huang et al, 2019;Ortega et al, 2019), or fine-tune different well-trained models to both speech and video signals (Ouyang et al, 2017;Zhang et al, 2017;Ma et al, 2019). Other usage of transfer learning for multi-modal methods includes leveraging the knowledge from one signal to another (e.g., video to speech) to reduce the potential bias (Athanasiadis et al, 2019).…”
Section: Multi-modal Transfer Learning For Emotion Recognitionmentioning
confidence: 99%
“…Compared with RNN, CNN is more suitable for computer vision applications; hence, its derivative C3D [107], which uses 3D convolutional kernels with shared weights along the time axis instead of the traditional 2D kernels, has been widely used for dynamic-based FER (e.g., [83], [108], [189], [197], [198]) to capture the spatio-temporal features. Based on C3D, many derived structures have been designed for FER.…”
Section: Rnn and C3dmentioning
confidence: 99%
“…The study [8] represents 27 distinct possible categories of human emotion but in case of music video, it is convenient to organize them with coarse semantic groups so that an end-user can easily demand the required music video from large video banks or online music video stores. We categorize the adjectives of music video emotion classification into six basic emotion categories with references [41,52,67], namely, Exciting, Fear, Neutral, Relaxation, Sad, and Tension. From each emotion class, respectively three samples are represented (from left to right) in Fig.…”
Section: Music Video Emotion Datasetmentioning
confidence: 99%
“…An extension on face emotion analysis is proposed on [69] using an audio spectrogram and human face image based on an integrated multimodal architecture. The multimodal approaches [11,13,41,44] have proposed audio and video by using a recurrent network with LSTM cells for face video emotion recognition. The one-dimensional (1D) audio network-and 2D video network-based multimodal [61] for speech recognition uses hybrid information fusion techniques by adding recurrent neural network after concatenation of learned features.…”
Section: Introductionmentioning
confidence: 99%