Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

Gu, Yue; Chen, Shuhong; Marsic, Ivan

doi:10.1109/icassp.2018.8462440

Cited by 37 publications

(42 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with [20,31] based on the IMOCAP database, the model presented in this paper performs better, as shown in Table 2. For the modal fusing, the feature-level and decision-level fusion method are all useful.…”

Section: Resultsmentioning

confidence: 88%

“…In order to test the performance of multimodal emotion recognition model proposed in this paper, we compared it with other different models on the IEMOCAP database. Soujanya Gu et al [20] applied CNN-LSTM to process the speech date and CNNs for the textual features learning; finally, they integrated all features and trained them with a three-layer deep neural network. ey adopted the feature fusion method which we also referenced.…”

Section: Resultsmentioning

confidence: 99%

“…For the research on multimodal emotional analysis, [17][18][19][20] all applied CNNs were as a trainable feature extractor to extract textual, visual, or audio features. Poria et al [17] fed all the emotional features into the MKL (multiple kernel learning) classifier.…”

Section: Related Workmentioning

confidence: 99%

“…In the model of [19], the final CNN layer computed the weighted sum of all the information extracted from the attention input. After extracting features from CNN networks, Gu et al [20] made use of a three-layer deep neural network to fuse the multimodal features.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Audio‐Textual Emotion Recognition Based on Improved Neural Networks

Cai

Dong

et al. 2019

Mathematical Problems in Engineering

View full text Add to dashboard Cite

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

show abstract

Section: Resultsmentioning

confidence: 88%

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Audio‐Textual Emotion Recognition Based on Improved Neural Networks

Cai

Dong

et al. 2019

Mathematical Problems in Engineering

View full text Add to dashboard Cite

show abstract

“…The encoder-decoder model was recently introduced in natural language processing and computer vision to model sequential data such as phrases [10,11,29,30] and videos [13]. It has shown great performance on a number of tasks including machine translation [6], question answering [25] and video description [13].…”

Section: Related Workmentioning

confidence: 99%

Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Duan

Lou

Srihari

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: 1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); 2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.

show abstract

Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning

Montes

Gómez

2020

Business Information Systems

View full text Add to dashboard Cite

Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

Cited by 37 publications

References 15 publications

Audio‐Textual Emotion Recognition Based on Improved Neural Networks

Audio‐Textual Emotion Recognition Based on Improved Neural Networks

Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning

Contact Info

Product

Resources

About