2020
DOI: 10.3390/s20226688
|View full text |Cite
|
Sign up to set email alerts
|

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Abstract: Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With the advent of deep learning, there have been some efforts in applying the deep-network-based approach to the problem of emotion recognition. As deep learning automatically extracts salient features correlated to sp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
15
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 27 publications
(16 citation statements)
references
References 51 publications
1
15
0
Order By: Relevance
“…A recurrent neural network (RNN) can be used to infer associations from 3D spectrogram data across different timesteps and frequencies [ 42 ]. “Fusion-ConvBERT” is a parallel fusion model proposed by Lee et al [ 43 ] that comprised bidirectional encoder representations derived from transformers and convolutional neural networks together. Zhang et al [ 44 ] built a deep convolution neural network (DCNN) and a bidirectional long short-term memory with attention (BLSTMwA) model (DCNN-BLSTMwA) that can be utilized as a pretrained model for subsequent emotion recognition tasks.…”
Section: Literature Reviewmentioning
confidence: 99%
“…A recurrent neural network (RNN) can be used to infer associations from 3D spectrogram data across different timesteps and frequencies [ 42 ]. “Fusion-ConvBERT” is a parallel fusion model proposed by Lee et al [ 43 ] that comprised bidirectional encoder representations derived from transformers and convolutional neural networks together. Zhang et al [ 44 ] built a deep convolution neural network (DCNN) and a bidirectional long short-term memory with attention (BLSTMwA) model (DCNN-BLSTMwA) that can be utilized as a pretrained model for subsequent emotion recognition tasks.…”
Section: Literature Reviewmentioning
confidence: 99%
“…They suggested a genetic learning-based collaborative decision-making model, which was compared to concatenated equal weighted choice fusion, BPN learning-based weighted decision fusion, and feature fusion methods. The audio spectrum features are obtained from BERT and CNN and are combined in parallel to form a multimodal [32].…”
Section: Related Studies and Motivationsmentioning
confidence: 99%
“…Tsai et al [41] proposed learning interactions between the modalities by designing an attention based cross-modal architecture using multimodal transformers. Also, recently, transfer learning techniques that use pre-trained networks to extract features [26], [62]- [64] have advanced significantly. BERT, a Transformer based model, has shown performance improvement by fine-tuning from pre-trained weights for a specific downstream task [65].…”
Section: Related Workmentioning
confidence: 99%
“…To improve performance, there have been some efforts of fusing different features extracted from a unimodal source. Lee et al [26] proposed a method that combines two features of CNN and BERT in parallel from an audio spectrogram. Xu et al [27] proposed "Hierarchical Grained and Feature Model (HGFM)" by fusion of handcrafted features and Gated Recurrent Unit (GRU) network extracted features.…”
Section: Introductionmentioning
confidence: 99%