Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Zhang, Tao; Tan, Zhenhua

doi:10.1007/s11042-023-17944-9

Cited by 1 publication

(2 citation statements)

References 165 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In summary, encoder A consists of computing the log-Mel spectrograms (Equation ( 8)), reshaping and normalization of spectrograms (Equation ( 9)), and emotion encoding using text processing models (Equation (10) or Equation ( 11)); encoder B utilizes the fine-tuning of pre-trained models (Equation (12) or Equation ( 13)), and the dual-stream outputs are fused and the framework is trained towards speech emotion prediction (Equation ( 14)). How to define the output of the CAF module is described in Equation (20).…”

Section: Dual-stream Representation Of Audio Signalsmentioning

confidence: 99%

“…Many multi-modal frameworks [10] have been developed for automated emotion recognition, since multi-modal representation offers the potential for a thorough and nuanced comprehension of emotional states. Liu et al explore peripheral physiological signals, EEG, and facial videos, and emotion dictionary learning with modality attention is proposed for mixed emotion recognition [11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Yu,

Meng,

Fan

et al. 2024

Electronics

View full text Add to dashboard Cite

Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition.

show abstract

Section: Dual-stream Representation Of Audio Signalsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%