Learning Alignment for Multimodal Emotion Recognition from Speech

Xu, Haiyang; Zhang, Hui; Han, Ke; Wang, Yun; Peng, Yuanyuan; Li, Xiangang

doi:10.21437/interspeech.2019-3247

Cited by 101 publications

(59 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12]. However, these approaches are designed only to consider the interaction between the acoustic and textual information.…”

Section: Recent Workmentioning

confidence: 99%

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.

show abstract

Section: Recent Workmentioning

confidence: 99%

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In this work, we decide to use the same model as in [22], where we align both audio and textual pre-trained representations through an attention mechanism on top of a bidirectional recurrent neural network. The only difference is the replacement of hand-engineered features by wav2vec embeddings and of textual GloVe embeddings [12] by BERT embeddings.…”

Section: Bimodal Emotion Recognitionmentioning

confidence: 99%

“…Last, we experiment with combining pre-trained embeddings for both audio and text. We align wav2vec representations and sub-words embeddings from BERT through an attention-based recurrent neural network to align both representations in time, similar to [22]. The resulting model is much larger than previous ones, and to avoid over-fitting we only train it on the full dataset.…”

Section: Bi-modal Transfer Learningmentioning

confidence: 99%

See 1 more Smart Citation

Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning

Boigne¹,

Liyanage²,

Östrem³

2020

Preprint

View full text Add to dashboard Cite

We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn’t require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).

show abstract

“…These restrictions found in the proposed techniques are reduced in upcoming works. Xu, H., et al [27] in 2019 proposed an attention mechanism with the ASR system to learn the alignment between the original speech and the recognized text, which is then used to fuse features from two modalities. The outcomes prove that the projected method is better than other methodologies concerning emotion identification ability.…”

Section: Related Workmentioning

confidence: 99%

Accurate Speech Emotion Recognition by using Brain-Inspired Decision-Making Spiking Neural Network

Jain¹,

Shukla²

2019

IJACSA

View full text Add to dashboard Cite

A portion of speech recognition is taken away by emotion recognition which is a smart update and it is necessary for its gain massively. Feature selection is an indispensable stage among the furtherance of various schemes in order to implement the classification of sentiments in speaking. The communication among features prompted from the alike audio origin has been rarely deliberated at present, which might yield terminated features and cause an upswing in the computational costs. To resolve these defects the deep learning-based feature extraction technique is used. An incredible modernization in speech recognition in recent years incorporates machine learning techniques with a deep structure for feature extraction. In this paper, the speech signal obtained from the SAVEE database is used as an input for a deep belief network. In order to perform pre-training in the network, the layer-wise rapacious feature extraction tactic is implemented and by using systematic samples, the smearing back-propagation method is accomplished for attaining fine-tuning. Brain-inspired decision-making spiking neural network (SNNs) is used to recognize different emotions but training by deep SNNs remains a challenge, but it improves the determination of the result. In order to enhance the parameters of SNNs, a social ski-driver (SSD) evolutionary optimization algorithm is used. The results of the SNN-SSD algorithm are related to artificial neural networks and long short term memory with different emotions to refine the classification for authorization. Keywords-Brain-inspired decision-making spiking neural network (BDM-SNN); deep belief network; social ski-driver (SSD) optimization; emotion recognition

show abstract

Learning Alignment for Multimodal Emotion Recognition from Speech

Cited by 101 publications

References 20 publications

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning

Accurate Speech Emotion Recognition by using Brain-Inspired Decision-Making Spiking Neural Network

Contact Info

Product

Resources

About