Role of Regularization in the Prediction of Valence from Speech

Sridhar, Kusha; Parthasarathy, S.; Busso, Carlos

doi:10.21437/interspeech.2018-2508

Cited by 20 publications

(17 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The corresponding relative improvements observed for arousal and dominance were less than 4%. These results showed that valence emotional cues include more speakerdependent traits, explaining why heavily regularizing a DNN helps to learn more general emotional cues across speakers [15]. Building on these results, we propose an unsupervised personalization approach that is extremely useful in the prediction of valence.…”

Section: Introductionmentioning

confidence: 70%

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Sridhar¹,

Busso²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.

show abstract

Section: Introductionmentioning

confidence: 70%

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Sridhar¹,

Busso²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although arousal and dominance have similar accuracy with dynamicOverlap (i.e., the differences are not statistically significant), our proposed Self-AttenVec method achieved the best valence CCC result (CCC=0.3337). Valence is an attribute that is particularly challenging to predict with acoustic features [52], [53], indicating that complete sentencelevel information can bring complemental benefits for more complex tasks. The advantage of applying attention models is amplified in the CNN and functional models.…”

Section: Proposed Chunk-level Ser Resultsmentioning

confidence: 99%

Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling

Lin

Busso

2023

IEEE Trans. Affective Comput.

Self Cite

View full text Add to dashboard Cite

A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition (SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.

show abstract

“…Adam [27] optimizer and exponential decay learning rate with initial rate 1e-3, decay rate 0.93 for every epoch, and final rate 5e-5 are used to optimize parameters. For the regularization, dropout with rate 0.7 as suggested in [28] is used for the output of encoder; l1 and l2 regularization with the weight 5e-3 are used for training RECOLA and IEMOCAP respectively. We train the models for 50 epochs with a batch size of 32, and 30% of data from test set is used as the development set for early stopping.…”

Section: Methodsmentioning

confidence: 99%

Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network

Cai

Zhong

et al. 2020

Preprint

View full text Add to dashboard Cite

By using deep learning approaches, Speech Emotion Recognition (SER) on a single domain has achieved many excellent results. However, cross-domain SER is still a challenging task due to the distribution shift between source and target domains. In this work, we propose a Domain Adversarial Neural Network (DANN) based approach to mitigate this distribution shift problem for cross-lingual SER. Specifically, we add a language classifier and gradient reversal layer after the feature extractor to force the learned representation both language-independent and emotion-meaningful. Our method is unsupervised, i. e., labels on target language are not required, which makes it easier to apply our method to other languages. Experimental results show the proposed method provides an average absolute improvement of 3.91% over the baseline system for arousal and valence classification task. Furthermore, we find that batch normalization is beneficial to the performance gain of DANN. Therefore we also explore the effect of different ways of data combination for batch normalization.

show abstract

Role of Regularization in the Prediction of Valence from Speech

Cited by 20 publications

References 25 publications

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling

Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network

Contact Info

Product

Resources

About