Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-548
|View full text |Cite
|
Sign up to set email alerts
|

Capturing Long-Term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition

Abstract: The goal of continuous emotion recognition is to assign an emotion value to every frame in a sequence of acoustic features. We show that incorporating long-term temporal dependencies is critical for continuous emotion recognition tasks. To this end, we first investigate architectures that use dilated convolutions. We show that even though such architectures outperform previously reported systems, the output signals produced from such architectures undergo erratic changes between consecutive time steps. This is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
32
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
1

Relationship

4
4

Authors

Journals

citations
Cited by 30 publications
(34 citation statements)
references
References 20 publications
(36 reference statements)
2
32
0
Order By: Relevance
“…Prior work has demonstrated the importance of considering long-term context when predicting valence (the same effect has not been shown in activation) [30]. The contextual annotations provided the annotators with this information, but the classifier could not take advantage of this effect.…”
Section: Questionmentioning
confidence: 97%
“…Prior work has demonstrated the importance of considering long-term context when predicting valence (the same effect has not been shown in activation) [30]. The contextual annotations provided the annotators with this information, but the classifier could not take advantage of this effect.…”
Section: Questionmentioning
confidence: 97%
“…For this reason, we select hyperparameters based on those found to be commonly selected in prior work and keep them constant for all experiments. A channel size of 128 is used for all convolutional and fully connected layers, as commonly selected in prior work [24], [52]. ReLU is used as the activation function for all but the final layer, as it has been show successful in the field and is computationally efficient [24], [53].…”
Section: Cnnmentioning
confidence: 99%
“…Stress has been shown to have varying effects on both the linguistic [5] and para-linguistic [37,41] components of communication. Previous work has also demonstrated that the lexical part of speech carries more information about valence while the para-linguistic part carries more information about activation [22]. As a result, we expect the performance of stress classification to vary based on modality, and emotion dimension being modeled.…”
Section: Questionmentioning
confidence: 90%
“…Acoustic. We use Mel Filterbank (MFB) features, which are frequently used in speech processing applications, including speech recognition, and emotion recognition [22,26]. We extract the 40-dimensional MFB features using a 25-millisecond Hamming window with a step-size of 10milliseconds.…”
Section: Featuresmentioning
confidence: 99%