2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639583
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Speech Emotion Recognition Using Audio and Text

Abstract: Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

7
135
0
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 223 publications
(143 citation statements)
references
References 27 publications
7
135
0
1
Order By: Relevance
“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”
Section: Recent Workmentioning
confidence: 99%
See 3 more Smart Citations
“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”
Section: Recent Workmentioning
confidence: 99%
“…This recurrent encoder will be used in the same manner to the audio, text and video modality, independently. One slight difference exists in the audio modality case such that we add an additional prosodic feature vector to the output of audio recurrent encoder following a previous research [9]. For the video data, we obtain a fixed dimensional representation of each frame from a pretrained ResNet.…”
Section: Recurrent Encodermentioning
confidence: 99%
See 2 more Smart Citations
“…In emotion recognition, many studies extracted features from audio, visual, or textual domains and then fuse them either in the feature levels or decision levels [18,19,20]. To leverage information from speech signals and text sequences, previous study [21] used neural networks to model two sequences separately and use direct concatenation of two modalities for emotion classification. In [22], a tensor fusion network was proposed to fuse features from different modalities and learn intra-modality and inter-modality dynamics.…”
Section: Related Workmentioning
confidence: 99%