2021
DOI: 10.3389/fnbot.2021.697634
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Abstract: The redundant information, noise data generated in the process of single-modal feature extraction, and traditional learning algorithms are difficult to obtain ideal recognition performance. A multi-modal fusion emotion recognition method for speech expressions based on deep learning is proposed. Firstly, the corresponding feature extraction methods are set up for different single modalities. Among them, the voice uses the convolutional neural network-long and short term memory (CNN-LSTM) network, and the facia… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(8 citation statements)
references
References 265 publications
0
7
0
1
Order By: Relevance
“…Therefore, our audio-visual model works like the human brain, analyzing both acoustic and visual information simultaneously. This strategy is known as model-level fusion [ 112 ].…”
Section: Methodsmentioning
confidence: 99%
“…Therefore, our audio-visual model works like the human brain, analyzing both acoustic and visual information simultaneously. This strategy is known as model-level fusion [ 112 ].…”
Section: Methodsmentioning
confidence: 99%
“…When the features of one mode are few, the existing information in the other mode can help emotional decision-making [36]. In this regard, the LSTM structure is applied for obtaining the dependence between different modes [37], the structure of which is displayed in Fig. 4.…”
Section: Selection Of Audio and Video Features For Features Fusion (A/v)mentioning
confidence: 99%
“…The accuracy of emotion recognition increased by 3% by considering the multimodal model and adding chisquare test. As a result, the use of chi-square test to eliminate redundancy and noise from the information of several features is significant [37].…”
Section: Chi-square Test Performancementioning
confidence: 99%
“…The redundant data, noise information produced during the time spent singlemodular component extraction, and conventional learning algorithms are hard to acquire ideal performance of recognitions. The authors in (5) propose a deep learning based multimodal fusion of emotion-based recognition strategy for voice expressions. During the process of person's social and day to day activities, voice, text and expressions of face are considered as primary channels to pass on human feelings.…”
Section: Introductionmentioning
confidence: 99%