Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3247
|View full text |Cite
|
Sign up to set email alerts
|

Learning Alignment for Multimodal Emotion Recognition from Speech

Abstract: Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodalit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
59
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
1
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 101 publications
(59 citation statements)
references
References 20 publications
0
59
0
Order By: Relevance
“…Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12]. However, these approaches are designed only to consider the interaction between the acoustic and textual information.…”
Section: Recent Workmentioning
confidence: 99%
“…Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12]. However, these approaches are designed only to consider the interaction between the acoustic and textual information.…”
Section: Recent Workmentioning
confidence: 99%
“…In this work, we decide to use the same model as in [22], where we align both audio and textual pre-trained representations through an attention mechanism on top of a bidirectional recurrent neural network. The only difference is the replacement of hand-engineered features by wav2vec embeddings and of textual GloVe embeddings [12] by BERT embeddings.…”
Section: Bimodal Emotion Recognitionmentioning
confidence: 99%
“…Last, we experiment with combining pre-trained embeddings for both audio and text. We align wav2vec representations and sub-words embeddings from BERT through an attention-based recurrent neural network to align both representations in time, similar to [22]. The resulting model is much larger than previous ones, and to avoid over-fitting we only train it on the full dataset.…”
Section: Bi-modal Transfer Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…These restrictions found in the proposed techniques are reduced in upcoming works. Xu, H., et al [27] in 2019 proposed an attention mechanism with the ASR system to learn the alignment between the original speech and the recognized text, which is then used to fuse features from two modalities. The outcomes prove that the projected method is better than other methodologies concerning emotion identification ability.…”
Section: Related Workmentioning
confidence: 99%