ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682541
|View full text |Cite
|
Sign up to set email alerts
|

Improving Speech Emotion Recognition with Unsupervised Representation Learning on Unlabeled Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

5
75
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 102 publications
(80 citation statements)
references
References 19 publications
5
75
0
Order By: Relevance
“…First, we observe that our model achieves the best classification accuracy in both validation and test cases among all models. To the best of our knowledge, the best results from the literature on IEMO-CAP with similar settings are generally around 60% [32,25]. We achieve a UA of 59.48% which is comparable with the state of the art results.…”
Section: Evaluation On Iemocapsupporting
confidence: 75%
See 1 more Smart Citation
“…First, we observe that our model achieves the best classification accuracy in both validation and test cases among all models. To the best of our knowledge, the best results from the literature on IEMO-CAP with similar settings are generally around 60% [32,25]. We achieve a UA of 59.48% which is comparable with the state of the art results.…”
Section: Evaluation On Iemocapsupporting
confidence: 75%
“…Two datasets are employed to evaluate the proposed MEnAN based emotion representation learning in our work: IEMOCAP dataset [24] consists of five sessions of speech segments with categorical emotion annotation, and there are two different speakers (one female and one male) in each session. In our work, we use both improvised and scripted speech recordings and merge excitement with happy to achieve a more balanced label distribution, a common experiment setting in many studies such as [10,25,26]. Finally, we obtain 5,531 utterances selected from four emotion classes (1, 103 angry, 1,636 happy, 1,708 neutral and 1,084 sad).…”
Section: Datesetmentioning
confidence: 99%
“…Because of the multi-faceted information included in the speech signal, transfer learning has been widely applied in speech-based emotion recognition (Table 1). Previously proposed approaches attempt to transfer the knowledge between datasets collected under similar conditions (e.g., audio signals collected by actors in the lab) Busso, 2015, 2018;Sagha et al, 2016;Zhang et al, 2016;Deng et al, 2017;Gideon et al, 2017;Neumann and Vu, 2019) or using the knowledge from acted in-lab audio signals to spontaneous speech collected in-the-wild (Deng et al, 2014b;Mao et al, 2016;Zong et al, 2016;Song, 2017;Gideon et al, 2019;Li and Chaspari, 2019).…”
Section: Transfer Learning For Speech-based Emotion Recognitionmentioning
confidence: 99%
“…Different types of transfer learning architectures have been explored in speech-based emotion recognition, including the statistical methods (Deng et al, 2013(Deng et al, , 2014aAbdelwahab and Busso, 2015;Song et al, 2015;Sagha et al, 2016;Zong et al, 2016;Song, 2017), the adversarial or generative networks (Chang and Scherer, 2017;Abdelwahab and Busso, 2018;Gideon et al, 2019;Latif et al, 2019), and other neural network structures (Mao et al, 2016;Deng et al, 2017;Gideon et al, 2017;Li and Chaspari, 2019;Neumann and Vu, 2019;Zhou and Chen, 2019). A commonly used input of the aforementioned approaches includes the feature set proposed by the INTERSPEECH emotion challenge and INTERSPEECH paralinguistic challenges (Schuller et al, 2009b, which typically contains the first 12 Mel Frequency Cepstral Coefficients, root-mean-square energy, zero-crossing rate, voice probability, and fundamental frequency (Deng et al, , 2014b, 2017Mao et al, 2016;Sagha et al, 2016;Zhang et al, 2016;Zong et al, 2016;Song, 2017;Abdelwahab and Busso, 2018;Li and Chaspari, 2019;Zhao et al, 2019).…”
Section: Transfer Learning For Speech-based Emotion Recognitionmentioning
confidence: 99%
“…We used the TED Talks as a dataset to track the tension development. TED Talks are a conference that presents ideas on various topics in a few minutes, and the video part has been used for emotional analysis and assessment of engagement exploiting the highly reliable English subtitles precisely synchronized to the video (Neumann and Vu (2019); Haider et al (2017)). For the annotation of tension development, we have chosen to use TED Talks with two specific reasons: (1) Due to the nature of public lectures, many utterances raise the tension to keep the attention of the audience.…”
Section: Datamentioning
confidence: 99%