Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning

Aggarwal, Apeksha; Srivastava, Akshat; Agarwal, Ajay; Chahal, Nidhi; Singh, Dilbag; Alnuaim, Abeer Ali; Alhadlaq, Aseel; Lee, Heung-No

doi:10.3390/s22062378

Cited by 64 publications

(15 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As far as the CREMA-D and TESS datasets are concerned, some authors have evaluated results on them by applying neural network and they achieved testing accuracy 55.01% [25] for CREMA-D and 97.15% [26] for TESS respectively as reflected in Tab. 3 which shows that the results achieved by single model technique in masked or complex are not good enough.…”

Section: Resultsmentioning

confidence: 99%

A Multi-Modal Deep Learning Approach for Emotion Recognition

Shahzad¹,

Bhatti²,

Jaffar³

et al. 2023

Intelligent Automation &Amp; Soft Computing

View full text Add to dashboard Cite

In recent years, research on facial expression recognition (FER) under mask is trending. Wearing a mask for protection from Covid 19 has become a compulsion and it hides the facial expressions that is why FER under the mask is a difficult task. The prevailing unimodal techniques for facial recognition are not up to the mark in terms of good results for the masked face, however, a multimodal technique can be employed to generate better results. We proposed a multimodal methodology based on deep learning for facial recognition under a masked face using facial and vocal expressions. The multimodal has been trained on a facial and vocal dataset. We have used two standard datasets, M-LFW for the masked dataset and CREMA-D and TESS dataset for vocal expressions. The vocal expressions are in the form of audio while the faces data is in image form that is why the data is heterogenous. In order to make the data homogeneous, the voice data is converted into images by taking spectrogram. A spectrogram embeds important features of the voice and it converts the audio format into the images. Later, the dataset is passed to the multimodal for training. neural network and the experimental results demonstrate that the proposed multimodal algorithm outsets unimodal methods and other state-of-the-art deep neural network models.

show abstract

Section: Resultsmentioning

confidence: 99%

A Multi-Modal Deep Learning Approach for Emotion Recognition

Shahzad¹,

Bhatti²,

Jaffar³

et al. 2023

Intelligent Automation &Amp; Soft Computing

View full text Add to dashboard Cite

show abstract

“…For example, reference [27] extracted high-level features from the original spectrogram, fused CNN and long short-term memory (LSTM) architectures, designed a neural network for speech emotion recognition, and used the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset to verify its effectiveness. Reference [28] combined the spectrogram and a three-layer LSTM to judge the robustness of the model to noisy data on the basis of comparing and analyzing whether the data is denoised or not. Reference [29] used the Gated Recurrent Unit (GRU) to recognize speech emotion, and achieved results comparable to LSTM on the basis of adding noise, but it can be applied to embedded devices.…”

Section: Recurrent Neural Network Model and Attention Mechanismmentioning

confidence: 99%

An Optimal Method for Speech Recognition Based on Neural Network

Ishak¹,

Madsen²,

Al-Zahrani³

2023

Intelligent Automation &Amp; Soft Computing

View full text Add to dashboard Cite

Natural language processing technologies have become more widely available in recent years, making them more useful in everyday situations. Machine learning systems that employ accessible datasets and corporate work to serve the whole spectrum of problems addressed in computational linguistics have lately yielded a number of promising breakthroughs. These methods were particularly advantageous for regional languages, as they were provided with cutting-edge language processing tools as soon as the requisite corporate information was generated. The bulk of modern people are unconcerned about the importance of reading. Reading aloud, on the other hand, is an effective technique for nourishing feelings as well as a necessary skill in the learning process. This paper proposed a novel approach for speech recognition based on neural networks. The attention mechanism is first utilized to determine the speech accuracy and fluency assessments, with the spectrum map as the feature extraction input. To increase phoneme identification accuracy, reading precision, for example, employs a new type of deep speech. It makes use of the exportchapter tool, which provides a corpus, as well as the TensorFlow framework in the experimental setting. The experimental findings reveal that the suggested model can more effectively assess spoken speech accuracy and reading fluency than the old model, and its evaluation model's score outcomes are more accurate.

show abstract

“…Features of speech have a vital part in the segregation of a speaker from others. Feature extraction reduces the magnitude of the speech signal, devoid of causing any damage to the power of the speech signal [ 15 , 16 , 17 ]. In [ 18 ], the authors introduced a new approach that exploits the fine-tuning of the size and shift parameters of the spectral analysis window used to compute the initial short-time Fourier transform to improve the performance of a speaker-dependent automatic speech recognition (ASR) system.…”

Section: Related Workmentioning

confidence: 99%

Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm

Abdusalomov

Safarov

Rakhimov

et al. 2022

Sensors

View full text Add to dashboard Cite

Speech recognition refers to the capability of software or hardware to receive a speech signal, identify the speaker’s features in the speech signal, and recognize the speaker thereafter. In general, the speech recognition process involves three main steps: acoustic processing, feature extraction, and classification/recognition. The purpose of feature extraction is to illustrate a speech signal using a predetermined number of signal components. This is because all information in the acoustic signal is excessively cumbersome to handle, and some information is irrelevant in the identification task. This study proposes a machine learning-based approach that performs feature parameter extraction from speech signals to improve the performance of speech recognition applications in real-time smart city environments. Moreover, the principle of mapping a block of main memory to the cache is used efficiently to reduce computing time. The block size of cache memory is a parameter that strongly affects the cache performance. In particular, the implementation of such processes in real-time systems requires a high computation speed. Processing speed plays an important role in speech recognition in real-time systems. It requires the use of modern technologies and fast algorithms that increase the acceleration in extracting the feature parameters from speech signals. Problems with overclocking during the digital processing of speech signals have yet to be completely resolved. The experimental results demonstrate that the proposed method successfully extracts the signal features and achieves seamless classification performance compared to other conventional speech recognition algorithms.

show abstract

Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning

Cited by 64 publications

References 36 publications

A Multi-Modal Deep Learning Approach for Emotion Recognition

A Multi-Modal Deep Learning Approach for Emotion Recognition

An Optimal Method for Speech Recognition Based on Neural Network

Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm

Contact Info

Product

Resources

About