Proceedings of the 25th ACM International Conference on Multimedia 2017
DOI: 10.1145/3123266.3123353
|View full text |Cite
|
Sign up to set email alerts
|

Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition

Abstract: Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the eld of image classication. Recently, empirical studies suggested that identity skipconnections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential of identity skip-connections for the task of Speech Emotion Recognition (SER) where moderately deep temporal architectures are often employed. To this end, we propose a n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 32 publications
(17 citation statements)
references
References 32 publications
0
16
0
Order By: Relevance
“…The network layers extract abstract representa-tions and also filter out the irrelevant information which leads to a more accurate classification [6,7] and better generalisation [8,9]. Temporal models were also proposed for modelling sequential data with mid to long-term dependencies [10,11].…”
Section: Introductionmentioning
confidence: 99%
“…The network layers extract abstract representa-tions and also filter out the irrelevant information which leads to a more accurate classification [6,7] and better generalisation [8,9]. Temporal models were also proposed for modelling sequential data with mid to long-term dependencies [10,11].…”
Section: Introductionmentioning
confidence: 99%
“…A CNN-LSTM model taking spectrograms as input was also proposed in [1]. CNNs with an atten- tion mechanism were investigated in [21] and [2] proposed an architecture composed of convolutional highway networks and LSTMs. SER models trained on a single corpus tend to overfit, leading to poor performance on out-of-domain data, as presented in [3].…”
Section: Related Workmentioning
confidence: 99%
“…Deep learning architectures, such as Convolutional Neural Networks (CNNs) [1] and highway networks [2], have been shown to yield state-of-the-art performance on this task. However, being able to use these models "in the wild" (i.e.…”
Section: Introductionmentioning
confidence: 99%
“…A multimodal system was built with a late fusion approach. On one side, the spectrograms of the audio cue were used to train a system similar to the one described in (Kim, et al, 2017b). As shown in Figure 2, a convolutional neural network (CNN) was used to extract descriptors from the data.…”
Section: Multimodal Emotion Recognitionmentioning
confidence: 99%
“…A first step towards a fully functional listening agent is to first build a recognition module that would feed a prediction system such as the one mentioned above. For this we utilize both audio and visual signals for emotion recognition using machine learning techniques, such as temporal (Kim & Provost, 2016) or deep learning models (Kim, Provost, & Lee, 2013), (Kim, 2017a(Kim, , 2017b.…”
Section: Introductionmentioning
confidence: 99%