Temporal Context in Speech Emotion Recognition

Xia, Yangyang; Chen, Liwei; Rudnicky, Alexander I.; Stern, Richard M.

doi:10.21437/interspeech.2021-1840

Cited by 19 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, frame-level emotion states need to be recog-nized to realize our method. While only utterance-level emotion labels are given for most of the SER dataset, several studies [15,1,20] indicate that frame-level emotion information can still be inferred by training with a segment-based classification objective. Particularly, as shown in Figure 1.a, we finetune wav2vec to extract frame-level emotion representation that is useful for predicting an utterance-level emotion label.…”

Section: Pseudo-label Task Adaptive Pretrainingmentioning

confidence: 99%

“…FCN+Attention [3] Spectrogram 63.9 Wav2vec w/o. FT [14] Wav2vec 64.3 Wav2vec w. FT [15] Waveform 66.9 Wav2vec 2.0 w/o. FT [16] Wav2vec 2.0 6 66.3 Wav2vec 2.0 w. V-FT Waveform 69.9 Wav2vec 2.0 w. TAPT Waveform 73.5 Wav2vec 2.0 w. P-TAPT Waveform…”

Section: Comparison With Prior Workmentioning

confidence: 99%

“…Despite the success of these methods in ASR, speaker verification, and mispronunciation detection [10,12,13], only a few attempts [14,15,16] have been made to apply them on SER. Boigne et al [14] find that wav2vec features are superior to traditional spectral-based features on SER.…”

Section: Introductionmentioning

confidence: 99%

“…Boigne et al [14] find that wav2vec features are superior to traditional spectral-based features on SER. Xia et al [15] compare features extracted with different time span and conclude that features with longer temporal context such as wav2vec perform better on SER. Pepino et al [16] show that features extracted from a linear combination of layers outperform singe layer representations in wav2vec 2.0 on SER.…”

Section: Introductionmentioning

confidence: 99%

“…In addition, with V-FT as a baseline, TAPT significantly boosts the performance of fine-tuning wav2vec 2.0 on SER. Furthermore, motivated by previous works on segmentbased emotion features [1,20,15] and self-supervised representation learning [11,21], we develop a novel fine-tuning procedure for SER which yields even better performance especially in low-resource conditions. Finally, we achieve a 7.4% absolute increase on unweighted accuracy (UA) over the SOTA performance on IEMOCAP.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition

Chen¹,

Rudnicky

2021

Preprint

Self Cite

View full text Add to dashboard Cite

While wav2vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform stateof-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT especially under lowresource settings. Compared to prior works in this literature, our top-line system achieved a 7.4% absolute improvement on unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available. 1

show abstract

Section: Pseudo-label Task Adaptive Pretrainingmentioning

confidence: 99%