Improving Speech Emotion Recognition with Unsupervised Representation Learning on Unlabeled Speech

Neumann, Michael; Vu, Ngoc Thang

doi:10.1109/icassp.2019.8682541

Cited by 102 publications

(80 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, we observe that our model achieves the best classification accuracy in both validation and test cases among all models. To the best of our knowledge, the best results from the literature on IEMO-CAP with similar settings are generally around 60% [32,25]. We achieve a UA of 59.48% which is comparable with the state of the art results.…”

Section: Evaluation On Iemocapsupporting

confidence: 75%

“…Two datasets are employed to evaluate the proposed MEnAN based emotion representation learning in our work: IEMOCAP dataset [24] consists of five sessions of speech segments with categorical emotion annotation, and there are two different speakers (one female and one male) in each session. In our work, we use both improvised and scripted speech recordings and merge excitement with happy to achieve a more balanced label distribution, a common experiment setting in many studies such as [10,25,26]. Finally, we obtain 5,531 utterances selected from four emotion classes (1, 103 angry, 1,636 happy, 1,708 neutral and 1,084 sad).…”

Section: Datesetmentioning

confidence: 99%

See 1 more Smart Citation

Speaker-Invariant Affective Representation Learning via Adversarial Training

Tao²,

Huang³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold-standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Specifically we propose to disentangle the speaker characteristics from emotion through an adversarial training network in order to better represent emotion. Our method combines the gradient reversal technique with an entropy loss function to remove such speaker information. Our approach is evaluated on both IEMOCAP and CMU-MOSEI datasets. We show that our method improves speech emotion classification and increases generalization to unseen speakers.

show abstract

Section: Evaluation On Iemocapsupporting

confidence: 75%

Section: Datesetmentioning

confidence: 99%

Speaker-Invariant Affective Representation Learning via Adversarial Training

Tao²,

Huang³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Because of the multi-faceted information included in the speech signal, transfer learning has been widely applied in speech-based emotion recognition (Table 1). Previously proposed approaches attempt to transfer the knowledge between datasets collected under similar conditions (e.g., audio signals collected by actors in the lab) Busso, 2015, 2018;Sagha et al, 2016;Zhang et al, 2016;Deng et al, 2017;Gideon et al, 2017;Neumann and Vu, 2019) or using the knowledge from acted in-lab audio signals to spontaneous speech collected in-the-wild (Deng et al, 2014b;Mao et al, 2016;Zong et al, 2016;Song, 2017;Gideon et al, 2019;Li and Chaspari, 2019).…”

Section: Transfer Learning For Speech-based Emotion Recognitionmentioning

confidence: 99%

“…Different types of transfer learning architectures have been explored in speech-based emotion recognition, including the statistical methods (Deng et al, 2013(Deng et al, , 2014aAbdelwahab and Busso, 2015;Song et al, 2015;Sagha et al, 2016;Zong et al, 2016;Song, 2017), the adversarial or generative networks (Chang and Scherer, 2017;Abdelwahab and Busso, 2018;Gideon et al, 2019;Latif et al, 2019), and other neural network structures (Mao et al, 2016;Deng et al, 2017;Gideon et al, 2017;Li and Chaspari, 2019;Neumann and Vu, 2019;Zhou and Chen, 2019). A commonly used input of the aforementioned approaches includes the feature set proposed by the INTERSPEECH emotion challenge and INTERSPEECH paralinguistic challenges (Schuller et al, 2009b, which typically contains the first 12 Mel Frequency Cepstral Coefficients, root-mean-square energy, zero-crossing rate, voice probability, and fundamental frequency (Deng et al, , 2014b, 2017Mao et al, 2016;Sagha et al, 2016;Zhang et al, 2016;Zong et al, 2016;Song, 2017;Abdelwahab and Busso, 2018;Li and Chaspari, 2019;Zhao et al, 2019).…”

Section: Transfer Learning For Speech-based Emotion Recognitionmentioning

confidence: 99%

A Review of Generalizable Transfer Learning in Automatic Emotion Recognition

Feng

Chaspari

2020

Front. Comput. Sci.

View full text Add to dashboard Cite

Automatic emotion recognition is the process of identifying human emotion from signals such as facial expression, speech, and text. Collecting and labeling such signals is often tedious and many times requires expert knowledge. An effective way to address challenges related to the scarcity of data and lack of human labels, is transfer learning. In this manuscript, we will describe fundamental concepts in the field of transfer learning and review work which has successfully applied transfer learning for automatic emotion recognition. We will finally discuss promising future research directions of transfer learning for improving the generalizability of automatic emotion recognition systems.

show abstract

“…We used the TED Talks as a dataset to track the tension development. TED Talks are a conference that presents ideas on various topics in a few minutes, and the video part has been used for emotional analysis and assessment of engagement exploiting the highly reliable English subtitles precisely synchronized to the video (Neumann and Vu (2019); Haider et al (2017)). For the annotation of tension development, we have chosen to use TED Talks with two specific reasons: (1) Due to the nature of public lectures, many utterances raise the tension to keep the attention of the audience.…”

Section: Datamentioning

confidence: 99%

Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing

Yoon¹,

Yang²,

Park³

2019

Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP

View full text Add to dashboard Cite

We propose a method of machine-assisted annotation for the identification of tension development, annotating whether the tension is increasing, decreasing, or staying unchanged. We use a neural network based prediction model, whose predicted results are given to the annotators as initial values for the options that they are asked to choose. By presenting such initial values to the annotators, the annotation task becomes an evaluation task where the annotators inspect whether or not the predicted results are correct. To demonstrate the effectiveness of our method, we performed the annotation task in both in-house and crowdsourced environments. For the crowdsourced environment, we compared the annotation results with and without our method of machineassisted annotation. We find that the results with our method showed a higher agreement to the gold standard than those without, though our method had little effect at reducing the time for annotation. Our codes for the experiment are made publicly available 1 . † Corresponding author 1 https://github.com/nlpcl-lab/ ted-talks-annotation

show abstract

Improving Speech Emotion Recognition with Unsupervised Representation Learning on Unlabeled Speech

Cited by 102 publications

References 19 publications

Speaker-Invariant Affective Representation Learning via Adversarial Training

Speaker-Invariant Affective Representation Learning via Adversarial Training

A Review of Generalizable Transfer Learning in Automatic Emotion Recognition

Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing

Contact Info

Product

Resources

About