A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition

Zhong, Ying; Hu, Ying; Hao, Huang; Silamu, Wushour

doi:10.21437/interspeech.2020-2408

Cited by 20 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Size UA(%) WA(%) F1(%) Han (2014) [2] 12.3 48.20 54.30 -Li (2019) [3] 9.90 67.40 -67.10 Zhong (2020) [4] 0.90 71.72 70.39 70.85 Ours (F-Loss, 7sec) 0.88 70.76 70.23 70.20…”

Section: Methodsmentioning

confidence: 99%

“…Size UA(%) WA(%) F1(%) Chen (2018) [5] 323 82.82 --Zhao (2019) [8] 4.34 79.70 --Zhong (2020) [4] 0 present simulation results to compare our model to several benchmarks on the IEMOCAP (scripted+improvised), IEMOCAP (improvised), and EMO-DB datasets in Tables 2, 3, and 4, respectively. As shown in Table 2, our model has slightly less WA, UA, and F1 than those of the Zhong model [4], on the IEMOCAP (scripted+improvised) dataset, which can be attributed to model training using different annotations in addition to the label of each utterance. On the EMO-DB dataset, due to the unavailability of different annotations for training, our model outperforms the Zhong model [4] by more than 2.4% (Table 4).…”

Section: Methodsmentioning

confidence: 99%

“…Automatic emotion recognition can be potentially used in a wide range of smart devices, especially in intelligent dialogue systems and voice assistants, such as Apple Siri, Amazon Alexa, and Google Assistant. Recently, identifying the emotional state of speakers from their speech utterances have received considerable attention [2][3][4][5][6][7][8]. Existing benchmarks of speech emotion recognition (SER) methods are mainly comprised of a feature extractor and a classifier to obtain the emotional states [2].…”

Section: Introductionmentioning

confidence: 99%

“…To reduce the model size and computational costs, Zhong et al [8] quantized the weights of the neural networks from the original full-precision values into binary values that can then be stored and processed more easily. Zhong et al [4] combined the attention mechanism and the focal loss, which concentrate the training process on learning hard samples and down-weighing easy samples, to resolve the problem with challenging samples.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition

Aftab¹,

Morsali²,

Ghaemmaghami³

et al. 2021

Preprint

View full text Add to dashboard Cite

Detecting emotions directly from a speech signal plays an important role in effective human-computer interactions. Existing speech emotion recognition models require massive computational and storage resources, making them hard to implement concurrently with other machine-interactive tasks in embedded systems. In this paper, we propose an efficient and lightweight fully convolutional neural network for speech emotion recognition in systems with limited hardware resources. In the proposed FCNN model, various feature maps are extracted via three parallel paths with different filter sizes. This helps deep convolution blocks to extract high-level features, while ensuring sufficient separability. The extracted features are used to classify the emotion of the input speech segment. While our model has a smaller size than that of the state-of-the-art models, it achieves a higher performance on the IEMOCAP and EMO-DB datasets. The source code is available https: //github.com/AryaAftab/LIGHT-SERNET

show abstract

“…Size UA(%) WA(%) F1(%) Han (2014) [2] 12.3 48.20 54.30 -Li (2019) [3] 9.90 67.40 -67.10 Zhong (2020) [4] 0.90 71.72 70.39 70.85 Ours (F-Loss, 7sec) 0.88 70.76 70.23 70.20…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition

Aftab¹,

Morsali²,

Ghaemmaghami³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…As one of the most important tasks of affective computing, speech emotion recognition (SER) aims to detect the emotional states of speakers, which has a wide range of applications, such as health care systems and human-machine interaction [1]. With the development of deep learning, many studies have employed convolutional neural network (CNN) and recurrent neural network (RNN) based models to generate more discriminative acoustic features for boosting the performance of SER [2,3,4,5,6]. Most of these methods regard static features as the input of network to learn highlevel features.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Network Based on the Fusion of Static and Dynamic Features for Speech Emotion Recognition

Cao

Hou

Chen

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Many studies on automatic speech emotion recognition (SER) have been devoted to extracting meaningful emotional features for generating emotion-relevant representations. However, they generally ignore the complementary learning of static and dynamic features, leading to limited performances. In this paper, we propose a novel hierarchical network called HNSD that can efficiently integrate the static and dynamic features for SER. Specifically, the proposed HNSD framework consists of three different modules. To capture the discriminative features, an effective encoding module is firstly designed to simultaneously encode both static and dynamic features. By taking the obtained features as inputs, the Gated Multi-features Unit (GMU) is conducted to explicitly determine the emotional intermediate representations for framelevel features fusion, instead of directly fusing these acoustic features. In this way, the learned static and dynamic features can jointly and comprehensively generate the unified feature representations. Benefiting from a well-designed attention mechanism, the last classification module is applied to predict the emotional states at the utterance level. Extensive experiments on the IEMOCAP benchmark dataset demonstrate the superiority of our method in comparison with state-of-the-art baselines.

show abstract

CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition

Wu,

Wang,

Zhang

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition

Cited by 20 publications

References 19 publications

Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition

Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition

Hierarchical Network Based on the Fusion of Static and Dynamic Features for Speech Emotion Recognition

CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition

Contact Info

Product

Resources

About