Attention Driven Fusion for Multi-Modal Emotion Recognition

Priyasad, Darshana; Fernando, Tharindu; Denman, Simon; Sridharan, Sridha; Fookes, Clinton

doi:10.1109/icassp40776.2020.9054441

Cited by 52 publications

(41 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To our knowledge, the best reported accuracy of using textual features only on this dataset was 70.8% [33]. Finetuning the off-the-shelf RoBERTa on word transcripts only (without pauses) achieved a better performance.…”

Section: Experiments and Resultsmentioning

confidence: 77%

“…We used IEMOCAP dataset [29], a benchmark dataset containing 12 hours of speech from 10 professional actors. Following the literature [30,31,32,33], we extracted 5531 utterances of four emotion types from the dataset: 1636 happy (also including excited), 1084 sad, 1103 angry, and 1708 neutral. The utterances were forced aligned using the P2FA forced aligner.…”

Section: Datamentioning

confidence: 99%

See 1 more Smart Citation

Pause-Encoded Language Models for Recognition of Alzheimer’s Disease and Emotion

Yuan¹,

Cai²,

Church³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose enhancing Transformer language models (BERT, RoBERTa) to take advantage of pauses. Pauses play an important role in speech. In previous work we developed a method to encode pauses in transcripts for recognition of Alzheimer's disease. In this study, we extend this idea to language models. We re-train BERT and RoBERTa using a large collection of pause-encoded transcripts, and conduct finetuning for two downstream tasks, recognition of Alzheimer's disease and emotion. Pause-encoded language models outperform text-only language models on these tasks. Pause augmentation by duration perturbation for training is shown to improve pause-encoded language models.

show abstract

Section: Experiments and Resultsmentioning

confidence: 77%

Section: Datamentioning

confidence: 99%

Pause-Encoded Language Models for Recognition of Alzheimer’s Disease and Emotion

Yuan¹,

Cai²,

Church³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…[ Priyasad et al] [64] presented a deep learning-based approach to protect codes that are characteristic emotion. Through a SincNet layer, band-pass filtering technique and neural net, the researchers managed to extract acoustic Identify applicable funding agency here.…”

Section: A Multimodal Emotion Recognition Combining (Audiomentioning

confidence: 99%

Multimodal Emotion Recognition using Deep Learning

Abdullah

Ameen

Sadeeq

et al. 2021

JASTT

168

View full text Add to dashboard Cite

New research into human-computer interaction seeks to consider the consumer's emotional status to provide a seamless human-computer interface. This would make it possible for people to survive and be used in widespread fields, including education and medicine. Multiple techniques can be defined through human feelings, including expressions, facial images, physiological signs, and neuroimaging strategies. This paper presents a review of emotional recognition of multimodal signals using deep learning and comparing their applications based on current studies. Multimodal affective computing systems are studied alongside unimodal solutions as they offer higher accuracy of classification. Accuracy varies according to the number of emotions observed, features extracted, classification system and database consistency. Numerous theories on the methodology of emotional detection and recent emotional science address the following topics. This would encourage studies to understand better physiological signals of the current state of the science and its emotional awareness problems.

show abstract

“…There is a recent interest on attention-based SER models for higher accuracy [8,9,12]. However, those attention mechanisms can only be calculated with a preset granularity which may not adapt dynamically to different areas of interest in spectrogram.…”

Section: Related Workmentioning

confidence: 99%

“…For example, a psychologist can design a treatment plan according to the emotions hidden/expressed in the patient's speech. Deep learning has accelerated the progress of recognizing human emotions from speech [4][5][6][7][8][9], but there are still deficiencies in the research of SER, such as data shortage and insufficient model accuracy.…”

Section: Introductionmentioning

confidence: 99%

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Zhang²,

Cui³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In Speech Emotion Recognition (SER), emotional characteristics often appear in diverse forms of energy patterns in spectrograms. Typical attention neural network classifiers of SER are usually optimized on a fixed attention granularity. In this paper, we apply multiscale area attention in a deep convolutional neural network to attend emotional characteristics with varied granularities and therefore the classifier can benefit from an ensemble of attentions with different scales. To deal with data sparsity, we conduct data augmentation with vocal tract length perturbation (VTLP) to improve the generalization capability of the classifier. Experiments are carried out on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. We achieved 79.34% weighted accuracy (WA) and 77.54% unweighted accuracy (UA), which, to the best of our knowledge, is the state of the art on this dataset.

show abstract

Attention Driven Fusion for Multi-Modal Emotion Recognition

Cited by 52 publications

References 23 publications

Pause-Encoded Language Models for Recognition of Alzheimer’s Disease and Emotion

Pause-Encoded Language Models for Recognition of Alzheimer’s Disease and Emotion

Multimodal Emotion Recognition using Deep Learning

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Contact Info

Product

Resources

About