TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON) 2019
DOI: 10.1109/tencon.2019.8929257
|View full text |Cite
|
Sign up to set email alerts
|

Emotion Recognition from Raw Speech using Wavenet

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
13
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(14 citation statements)
references
References 14 publications
1
13
0
Order By: Relevance
“…The lexical features, however, labeled the sentences without any emotional words as "neutral", which biased the decision towards the "neutral" class for some cases, compensating the misclassification in the classifier based on acoustic features. Still, the performance for all four classes was rather similar in contrast to some of the reported papers [17,18,27,33], for which the differences between the highest and lowest detection probabilities among four classes range from 27.1% to 64%, resulting in high UARs.…”
Section: Discussionsupporting
confidence: 54%
See 1 more Smart Citation
“…The lexical features, however, labeled the sentences without any emotional words as "neutral", which biased the decision towards the "neutral" class for some cases, compensating the misclassification in the classifier based on acoustic features. Still, the performance for all four classes was rather similar in contrast to some of the reported papers [17,18,27,33], for which the differences between the highest and lowest detection probabilities among four classes range from 27.1% to 64%, resulting in high UARs.…”
Section: Discussionsupporting
confidence: 54%
“…Recently, deep learning approaches have been incorporated in speech emotion recognition and brought performance improvement [9][10][11][12][13][14][15][16][17][24][25][26][27][28][29][30][31][32][33]. In [24], a deep neural network (DNN) classifier was adopted for which the input consisted of acoustic features extracted from Mel frequency spectral coefficients using a convolutional neural network (CNN)-long short-term memory (LSTM), other acoustic features obtained by a DNN based on the LLD, and lexical features extracted as outputs of CNNs from words and parts-of-speech tags.…”
Section: Introductionmentioning
confidence: 99%
“…MFCC considers human perception for sensitivity at appropriate frequencies by converting the conventional frequency to Mel scale; thus, it is suitable for speech recognition tasks. Alternatively, we use an audio generative model based on CNN, called WaveNet [24], [25], which is pre-trained based on NSynth dataset [26] to produce the salient representation of audio signals. The feature dimension of MFCC sequence is nT × C; where C is the number of coefficients, and feature dimension of WaveNet sequence is nT ×W .…”
Section: A Feature Extractions For Visual-audiomentioning
confidence: 99%
“…We determine feature representations for audio signals by computing the Mel-Frequency Cepstral Coefficients (MFCC) [20]- [23]. In addition, we extract other audio representations using an CNN-based generative model, called WaveNet [24], [25], which was pre-trained on NSynth [26]. After extracting all representations for visual and audio information, we feed these features to a deep graph fusion module to learn fused representations in a graph structure.…”
Section: Introductionmentioning
confidence: 99%
“…In [9], a comparison of the effectiveness of traditional features versus end-to-end learning in atypical affect and crying recognition is presented, only to conclude that there is no clear winner. Moreover, works in [10] and [11] have also utilized raw speech for emotion classification.…”
Section: Introductionmentioning
confidence: 99%