Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Liu, Dong; Wang, Zhiyong; Wang, Lifeng; Chen, Longxi

doi:10.3389/fnbot.2021.697634

Cited by 25 publications

(8 citation statements)

References 265 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Therefore, our audio-visual model works like the human brain, analyzing both acoustic and visual information simultaneously. This strategy is known as model-level fusion [ 112 ].…”

Section: Methodsmentioning

confidence: 99%

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Ryumin

Ivanko

Ryumina

2023

Sensors

View full text Add to dashboard Cite

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.

show abstract

“…Therefore, our audio-visual model works like the human brain, analyzing both acoustic and visual information simultaneously. This strategy is known as model-level fusion [ 112 ].…”

Section: Methodsmentioning

confidence: 99%

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Ryumin

Ivanko

Ryumina

2023

Sensors

View full text Add to dashboard Cite

show abstract

“…When the features of one mode are few, the existing information in the other mode can help emotional decision-making [36]. In this regard, the LSTM structure is applied for obtaining the dependence between different modes [37], the structure of which is displayed in Fig. 4.…”

Section: Selection Of Audio and Video Features For Features Fusion (A/v)mentioning

confidence: 99%

“…The accuracy of emotion recognition increased by 3% by considering the multimodal model and adding chisquare test. As a result, the use of chi-square test to eliminate redundancy and noise from the information of several features is significant [37].…”

Section: Chi-square Test Performancementioning

confidence: 99%

Multimodal modeling of human emotions using sound, image and text fusion

Hosseini

Yamaghani

Arabani

2023

Preprint

View full text Add to dashboard Cite

Multimodal emotion recognition and analysis is considered a developing research field. Improving the multimodal fusion mechanism plays a key role in the more detailed recognition of the recognized emotion. The present study aimed to optimize the performance of the emotion recognition system and presented a model for multimodal emotion recognition from audio, text, and video data. First, the data were fused as a combination of video and audio, then as a combination of audio and text as binary, and finally the results were fused together. The final output included audio, text, and video data taking common features into account. Then, the convolutional neural network, as well as long-term and short-term memory (CNN-LSTM), were used to extract audio. Next, the Inception-Res Net-v2 network was applied for extracting the facial expression in the video. The output fused data were utilized by LSTM as the input of the softmax classifier to recognize the emotion of audio and video features fusion. In addition, the CNN-LSTM was combined in the form of a binary channel for learning audio emotion features. Meanwhile, a Bi-LSTM network was used to extract the text features and softmax was used for classifying the fused features. Finally, the generated results were fused together for the final classification, and the logistic regression model was used for fusion and classification. The results indicated that the recognition accuracy of the proposed method in the IEMOCAP data set was 82.9.

show abstract

“…The redundant data, noise information produced during the time spent singlemodular component extraction, and conventional learning algorithms are hard to acquire ideal performance of recognitions. The authors in (5) propose a deep learning based multimodal fusion of emotion-based recognition strategy for voice expressions. During the process of person's social and day to day activities, voice, text and expressions of face are considered as primary channels to pass on human feelings.…”

Section: Introductionmentioning

confidence: 99%

System for Fusion of Face and Speech Modalities Using DTCWT+QFT and MFCC+RASTA Techniques

Shanthakumar¹,

Nagaraja²,

Basthikodi³

2021

IJST

View full text Add to dashboard Cite

Objectives:The main objective is to propose a multimodal biometric system by forming a fusion of Face and Speech modalities using DTCWT+QFT techniques for face and MFCC+RASTA Techniques for Speech recognitions. The experimental results are compared with existing works and analysed the performance with counterparts. Methods: The proposed model, make use of DTCWT and QFT techniques to extract the features of face images and perform fusion of both. The MFCC and RASTA techniques are implemented to extract features of speech data and then fusion is applied. Various databases discussed and utilized for both face and speech recognition system proposed. Findings: The results of experimentation are compared with existing systems and analysis proved than the proposed system is placed in better position. The fusion of DTCWT and QFT techniques for face recognition system is implemented and the results using performance parameters such as False Acceptation Ratio (FAR), False Rejection Ratio (FRR), Total Success Rate (TSR), Partial Error Rate (PER), Equal Error Rate (EER) are tabulated for six different types of face data sets. The average performance of the results is compared with four existing fusion techniques and showed that the proposed system performs better. The fusion of MFCC and RASTA techniques for speech recognition system is implemented and the performance is measured by calculating accuracy, precision, recall and F1-score. These results are compared with five different schemes and proved that proposed system of fusion of face and speech traits works better for human recognitions. Novelty: Fusion of two algorithms for face recognition is implemented and the results analysed. Then the fusion of two algorithms for speech recognition is implemented and the results are analysed. The novel approach is presented to combine both face and speech recognition system in to single system to improve the security using multimodal biometrics.

show abstract

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Cited by 25 publications

References 265 publications

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Multimodal modeling of human emotions using sound, image and text fusion

System for Fusion of Face and Speech Modalities Using DTCWT+QFT and MFCC+RASTA Techniques

Contact Info

Product

Resources

About