Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition

Wang, Shijun; Hemati, Hamed; Guðnason, Jón; Borth, Damian

doi:10.21437/interspeech.2022-10667

Cited by 2 publications

(3 citation statements)

References 38 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generating spectrograms or raw waveforms provides more flexibility by allowing us to train models directly on the raw data. Chatziagapi et al [4] and Wang et al [5] proposed generating mel spectrograms using GANs to tackle data imbalance by augmenting the minority classes. Similarly Eskimez et al [16] used an improved version of GANs with higher generation quality to apply SER data augmentation using spectrograms.…”

Section: Related Workmentioning

confidence: 99%

“…Synthetic data is artificially generated data, which can be used to replace or augment real data in training deep learning models. Such approach has multiple advantages in terms of data privacy and security [3], balancing skewed datasets [4,5], as well as overcoming the lack of large datasets, as the case with SER [6]. The quality and realism of synthetic data is critical for its effectiveness in deep learning applications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards Improving Speech Emotion Recognition Using Synthetic Data Augmentation from Emotion Conversion

Ibrahim,

Perzo,

Leglaive

2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

One of the main challenges in speech emotion recognition is the lack of large labelled datasets. The progress in speech synthesis allows us to generate reliable and realistic expressive speech. In this work, we propose using a state-of-the-art end-to-end speech emotion conversion model to generate new synthetic data for training speech emotion recognition models. We first evaluate the quality of the converted speech on new unseen datasets, which proves to be on par with the training data. Then, we study the effect of using the synthesized speech as data augmentation. We show that this approach improves the overall performance of emotion recognition models on two different datasets, IEMOCAP and RAVDESS, both in the cases of speaker dependent and independent emotion recognition using a fine-tuned wav2vec 2.0.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Towards Improving Speech Emotion Recognition Using Synthetic Data Augmentation from Emotion Conversion

Ibrahim,

Perzo,

Leglaive

2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Data augmentation is an efective method to solve this problem [16]. For example, generative adaptive networks (GANs) and its variants are often applied to generate new samples [17][18][19]. Alternatively, a larger data can be directly constructed from existing data with hand-crafted features [20].…”

Section: Introductionmentioning

confidence: 99%

Multimodal and Multitask Learning with Additive Angular Penalty Focus Loss for Speech Emotion Recognition

Wen,

Ye,

et al. 2023

International Journal of Intelligent Systems

View full text Add to dashboard Cite

Speech emotion recognition has lots of applications such as human-computer interaction and health management. The current methods are challenged with the problems of fuzzy decision boundary and imbalance between difficult and easy samples in the training data. This paper first proposes an additive angle penalty focus loss function (APFL), which strictly refines the fuzzy decision boundary by introducing angle penalty factors to improve the compactness within the class and enlarge the distance between classes. It also assigns the larger loss to difficult samples to make the model pay more attention to them, as they are easily misclassified. Simultaneously, due to the lack of training samples, the framework of multimodal and multitask learning with APFL is further proposed, which extracts spectrogram features by deep neural network, text features by the pretrained language model, and audio features by the pretrained sound model. It uses the gender recognition as an auxiliary task. The experimental results verify the effectiveness of the proposed loss function and framework.

show abstract

Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition

Cited by 2 publications

References 38 publications

Towards Improving Speech Emotion Recognition Using Synthetic Data Augmentation from Emotion Conversion

Towards Improving Speech Emotion Recognition Using Synthetic Data Augmentation from Emotion Conversion

Multimodal and Multitask Learning with Additive Angular Penalty Focus Loss for Speech Emotion Recognition

Contact Info

Product

Resources

About