Interspeech 2015 2015
DOI: 10.21437/interspeech.2015-336
|View full text |Cite
|
Sign up to set email alerts
|

High-level feature representation using recurrent neural network for speech emotion recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
144
0
2

Year Published

2015
2015
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 271 publications
(146 citation statements)
references
References 15 publications
0
144
0
2
Order By: Relevance
“…The Angry and Neutral categories exhibit much lower performance than the Happy and Sad categories, which can be understood from the presented scatter plots in Figure 2, as the Angry class has an overlap with Happy and Neutral categories, and Neutral has an overlap with all the other categories. (Lee & Tashev, 2015) report an unweighted and weighted accuracy of 52.13%, and 57.91% respectively for the DNN-ELM model.…”
Section: Frame Based Representations With the Denoising Autoencodermentioning
confidence: 94%
See 1 more Smart Citation
“…The Angry and Neutral categories exhibit much lower performance than the Happy and Sad categories, which can be understood from the presented scatter plots in Figure 2, as the Angry class has an overlap with Happy and Neutral categories, and Neutral has an overlap with all the other categories. (Lee & Tashev, 2015) report an unweighted and weighted accuracy of 52.13%, and 57.91% respectively for the DNN-ELM model.…”
Section: Frame Based Representations With the Denoising Autoencodermentioning
confidence: 94%
“…(1) A network with the same architecture without pre-training (2) A softmax classifier trained on features extracted from the COVAREP toolbox, such as MFCCs and prosodic features (for example, pitch, peak slope, Normalized Amplitude Quotient (NAQ), and difference between first two harmonics in speech (H1-H2)) and (3) the DNN-ELM approach described in Han et al (2014), for which results in the same experimental setting (emotion categories and speaker-independent data splits) as our work have been reported in Lee & Tashev (2015). The Angry and Neutral categories exhibit much lower performance than the Happy and Sad categories, which can be understood from the presented scatter plots in Figure 2, as the Angry class has an overlap with Happy and Neutral categories, and Neutral has an overlap with all the other categories.…”
Section: Frame Based Representations With the Denoising Autoencodermentioning
confidence: 99%
“…Mao et al [3] firstly introduced Convolutional Neural Networks (CNNs) for the SER task and obtained remarkable results on various datasets by learning affective-salient features. Recurrent neural networks (RNNs) has also been introduced for SER purpose with a deep Bidirectional Long Short-Term Memory (BLTSM) network proposed by Lee et al [4]. Several papers have then presented CNNs in combination with LSTM cells to improve speech emotion recognition, based on log Mel filterbanks (logMel) [5] or raw signal in an end-to-end manner [6].…”
Section: Related Workmentioning
confidence: 99%
“…The total number of speakers in the corpus is 10. We only considered the samples belonging to the four emotional categories of happiness, sadness, neutral and anger, to keep the analysis consistent with previous works [6,7,8,9,10,11,15,16]. The number of utterances in each emotional class of each speaker is shown in Table 1.…”
Section: Corpusmentioning
confidence: 99%
“…Initial DNN-based models [4] were still based on the same utterance-level feature extraction. However, in subsequent approaches, speech features extracted from each frame were used as inputs of more complex neural network architectures such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), and the accuracy was further improved [5,6,7]. Recent years saw the application of novel methods developed from other AI fields, such as self-attention models [8], Connectionist Temporal Classification (CTC) [9] and Dilated Residual Network (DRN) [10].…”
Section: Introductionmentioning
confidence: 99%