Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-736
|View full text |Cite
|
Sign up to set email alerts
|

Towards Speech Emotion Recognition “in the Wild” Using Aggregated Corpora and Deep Multi-Task Learning

Abstract: One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
38
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(48 citation statements)
references
References 25 publications
1
38
0
Order By: Relevance
“…Emotion is only one of several factors that impacts the acoustics of speech. Some factors that change across datasets and can impact affect recognition include the environmental noise [4], the spoken language [3], the recording device quality [5], and the elicitation strategy (acted versus natural) [6]. Additionally, a mismatch in subject demographics between datasets can result in misclassification, due to the small numbers of participants common in speech emotion recognition datasets [2].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Emotion is only one of several factors that impacts the acoustics of speech. Some factors that change across datasets and can impact affect recognition include the environmental noise [4], the spoken language [3], the recording device quality [5], and the elicitation strategy (acted versus natural) [6]. Additionally, a mismatch in subject demographics between datasets can result in misclassification, due to the small numbers of participants common in speech emotion recognition datasets [2].…”
Section: Introductionmentioning
confidence: 99%
“…Most modern techniques of cross-corpus speech emotion recognition use deep learning to build representations over lowlevel acoustic features. Many of these techniques incorporate tasks in addition to emotion to be able to learn more robust representations [6], [10].…”
Section: Introductionmentioning
confidence: 99%
“…In the first experiment, we discuss the influence of different feature selection manners and prediction manners. Since multi-task learning shows its performance in [6,20,21], we treat the multi-task learning as the comparison approach in the second experiment. In the last experiment, we show the advantages of adding the contrastive loss during the training phase.…”
Section: Evaluation Resultsmentioning
confidence: 99%
“…DAE and ladder structures pay less attention to modeling long-term dynamic dependencies. However, temporal information is important for speech emotion recognition [17,18]. Therefore, our self-attention based FOP model, which can capture long-term dynamic dependencies, is superior to other currently advanced unsupervised learning strategies.…”
Section: Comparison To Other Advanced Approachesmentioning
confidence: 99%
“…These methods [12,13] take the whole input into account, aiming to learn intermediate feature representations that can reconstruct the input. However, they pay less attention to modeling the long-term dynamic dependency, which is important for speech emotion recognition [17,18].…”
Section: Introductionmentioning
confidence: 99%