A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

Zhang, Huiyun; Huang, Heming; Han, Henry

doi:10.20944/preprints202108.0433.v1

Cited by 9 publications

(4 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the proposed ESN model is able to outperform these deep learning models by achieving 84.80% UA. Authors in [13] adopted the Heterogeneous Parallel Convolution Bi-LSTM model and applied speaker-independent for SAVEE dataset and they achieved 56.5% of UA, and Random Deep Belief Networks model [44] performed 53.60% UA for SAVEE, however, our method obtained 65.95% UA. Unlike EMODB and SAVEE, one can notice the big challenge to gain higher accuracy for the Aibo dataset.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 84%

“…Typical approaches for supporting temporal data are LSTM and ESN. Authors in [13] used Bi-LSTM deep learning with two heterogeneous branches where the left side has two dense layers and the right side has a convolution layer. Additionally, the handcrafted time-series features with 512 frames are used in [14] for feeding CNN and Bi-LSTM model.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Reservoir Computing With Truncated Normal Distribution for Speech Emotion Recognition

Ibrahim

Loo

2022

MJCS

View full text Add to dashboard Cite

Speech is an effective, quick, and important way for communicating and exchanging complex information between humans. Emotions have always been a part of normal human conversation which makes the speech more attractive. Because of this major role of both speech and emotion, many researchers are inspired by studying Speech Emotion Recognition (SER) which still has plenty of challenges. In this study, we proposed a novel reservoir computing approach with the initialization of random connection weights for the input weight by the truncated normal distribution. Furthermore, Population-Based Training (PBT) is adopted to optimize the hyperparameters of the whole Echo State Network (ESN) model which have a significant impact on the model performance. The proposed model has adopted bidirectional reservoir input to increase the memorization capability, and Sparse Random Projection (SRP) was applied for dimensional reduction as a simple, unsupervised, and low complexity approach. The speaker-independent strategy was employed on EMODB and SAVEE datasets as an acted speech emotion dataset and Aibo as a non-acted dataset. The model achieved 84.8%, 65.95%, and 45.99% unweighted average recalls on the EMODB, SAVEE, and Aibo datasets respectively. The results show that the proposed model outperforms the recent state-of-the-art studies with a cheaper computational cost.

show abstract

Section: Comparison With the State-of-the-artmentioning

confidence: 84%

Section: Literature Reviewmentioning

confidence: 99%

Reservoir Computing With Truncated Normal Distribution for Speech Emotion Recognition

Ibrahim

Loo

2022

MJCS

View full text Add to dashboard Cite

show abstract

“…Reference [25] adopted the spectrogram, fused it with the convolutional neural network (CNN) and attention mechanism, and used two different convolution kernels to extract time-domain features and frequency-domain features respectively, and saved the spectrogram as an image directly. After normalization processing, the accuracy of speech emotion evaluation is high.…”

Section: Spectrogrammentioning

confidence: 99%

An Optimal Method for Speech Recognition Based on Neural Network

Ishak¹,

Madsen²,

Al-Zahrani³

2023

Intelligent Automation &Amp; Soft Computing

View full text Add to dashboard Cite

Natural language processing technologies have become more widely available in recent years, making them more useful in everyday situations. Machine learning systems that employ accessible datasets and corporate work to serve the whole spectrum of problems addressed in computational linguistics have lately yielded a number of promising breakthroughs. These methods were particularly advantageous for regional languages, as they were provided with cutting-edge language processing tools as soon as the requisite corporate information was generated. The bulk of modern people are unconcerned about the importance of reading. Reading aloud, on the other hand, is an effective technique for nourishing feelings as well as a necessary skill in the learning process. This paper proposed a novel approach for speech recognition based on neural networks. The attention mechanism is first utilized to determine the speech accuracy and fluency assessments, with the spectrum map as the feature extraction input. To increase phoneme identification accuracy, reading precision, for example, employs a new type of deep speech. It makes use of the exportchapter tool, which provides a corpus, as well as the TensorFlow framework in the experimental setting. The experimental findings reveal that the suggested model can more effectively assess spoken speech accuracy and reading fluency than the old model, and its evaluation model's score outcomes are more accurate.

show abstract

“…Mei Wang, et al [6] fused electroencephalographs (EEGs) and facial expression information by using maximum weight multimodal fusion in decision level fusion. However, a heterogeneity gap exists [7] between the different modalities and some researchers have neglected the connection between them [8]. We need to narrow the heterogeneity gaps and make good use of the connections between the different modalities.…”

Section: Imentioning

confidence: 99%

Cross-corpus bimodal speech emotion recognition

Liu

Kexin

2022

Preprint

View full text Add to dashboard Cite

Despite speech emotion recognition(SER) makes a significant contribution to artificial intelligence, there exists a heterogeneity gap between different modalities. Moreover, most cross-corpus SER only use audio modality. There are few studies on cross-corpus bimodal SER. Motivated by these problems, in this work, we address these issues at the same time. We design YouTube dataset as a source data and interactive emotional dyadic motion capture database (IEMOCAP) as a target data. In both source data and target data, we use CNN and bidirectional long short term memory network (Bi-LSTM) to extract speech features and use Bidirectional Encoder Representation from Transformers (BERT)+ Bi-LSTM to extract text features , then we design modality-invariance loss to form a common representation space of two modalities. To deal with the problem of cross-corpus SER, we learn a common subspace of source data and target data by optimizing Linear Discriminant analysis(LDA), Maximum Mean Discrepancy (MMD) and Graph Embedding (GE) jointly. To preserve emotion-discriminative features, we add emotion-aware center loss .We use SVM classifier as final emotion classification.The experiment results on IEMOCAP demonstrate that our method is superior to other state-of-art cross-corpus and bimodal SER.

show abstract

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

Cited by 9 publications

References 25 publications

Reservoir Computing With Truncated Normal Distribution for Speech Emotion Recognition

Reservoir Computing With Truncated Normal Distribution for Speech Emotion Recognition

An Optimal Method for Speech Recognition Based on Neural Network

Cross-corpus bimodal speech emotion recognition

Contact Info

Product

Resources

About