Attention-based Sequence Classification for Affect Detection

Gorrostieta, Cristina; Brutti, Richard; Taylor, Kye; Shapiro, Avi M.; Moran, Joseph; Azarbayejani, A.; Kane, John

doi:10.21437/interspeech.2018-1610

Cited by 13 publications

(9 citation statements)

References 11 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recurrent Stage: Gated recurrent units (GRU) and long short-term memory units (LSTM) [54] are the two most common recurrent types in paralinguistics [9], [18], [19], [55], [56]. Unidirectional [9], [18] as well as bidirectional [19], [56] networks are popular.…”

Section: Hyperparameter Search Spacementioning

confidence: 99%

“…However, as these implementations do not allow to alter the activation function nor implement recurrent batch normalization [57] we fixed the corresponding parameter ranges to the implementations preset values. When using the recurrent stage as the first one in the network, we scanned for unit numbers of up to 128 as RNNs commonly are shallower and wider than CNNs [18], [55], [58].…”

Section: Hyperparameter Search Spacementioning

confidence: 99%

“…Temporal Integration Stage: The temporal integration operations were adapted from Mirsamadi et al [59]. Particularly attention pooling was widely employed in the INTERSPECCH 2018 ComParE challenge [55], [56]. All pooling types incorporate outputs at every time step, while "Last Step" means that only the output of the last time step is forwarded to the next layer (alias "many-to-one"-prediction).…”

Section: Hyperparameter Search Spacementioning

confidence: 99%

See 2 more Smart Citations

Comparison of Artificial Neural Network Types for Infant Vocalization Classification

Anders

Hlawitschka

Fuchs

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Section: Hyperparameter Search Spacementioning

confidence: 99%

Section: Hyperparameter Search Spacementioning

confidence: 99%

Section: Hyperparameter Search Spacementioning

confidence: 99%

See 1 more Smart Citation

Comparison of Artificial Neural Network Types for Infant Vocalization Classification

Anders

Hlawitschka

Fuchs

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

“…When combined with a self attention mechanism, emotionally informative time-segments of an input can be highlighted [60]. Mirsamadi et al used an attention RNN for SER on Interactive Emotional Dyadic Motion Capture (IEMOCAP) while Gorrostieta et al applied a similar model with low-level spectral features as input for the ComParE self-asessed affect [61] sub-challenge [62]. More recently, combining CNN feature extractors with attention based RNNs has been shown to be a highly competitive approach to SER [63], [64].…”

Section: Deep Learning Based Sermentioning

confidence: 99%

EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition

Gerczuk¹,

Amiriparian²,

Ottl³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EMOSET, is assembled from a number of existing SER corpora. In total, EMOSET contains 84 181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EMONET. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EMOSET. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only 3.5 times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EMOSET. Measured by McNemar's test, these improvements are further significant for ten datasets at p < 0.05 while there are just two corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EMONET framework publicly available for users and developers at https://github.com/EIHW/EmoNet. EMONET provides an extensive command line interface which is comprehensively documented and can be used in a variety of multi-corpus transfer learning settings.

show abstract

“…In addition to learning useful spatio-temporal features, it is also important to select the emotionally salient sections of an input signal to improve SER performance further [11]. The use of attention mechanisms in RNN and CNN-based models has frequently been demonstrated as a useful tool to encourage a model to more heavily weight specific regions of an input sequence or image [12]. Attention mechanisms have also been effectively applied in SER [11], [13]- [15].…”

Section: Introductionmentioning

confidence: 99%

Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition

et al. 2019

View full text Add to dashboard Cite

The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human-machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features. Results from these works have demonstrated the importance of discriminative spatio-temporal features to model the continual evolutions of different emotions. Recently, spectrogram representations of emotional speech have achieved competitive performance for automatic speech emotion recognition (SER). How machine learning algorithms learn the effective compositional spatio-temporal dynamics for SER has been a fundamental problem of deep representations, herein denoted as deep spectrum representations. In this paper, we develop a model to alleviate this limitation by leveraging a parallel combination of attention-based bidirectional long short-term memory recurrent neural networks with attention-based fully convolutional networks (FCN). The extensive experiments were undertaken on the interactive emotional dyadic motion capture (IEMOCAP) and FAU aibo emotion corpus (FAU-AEC) to highlight the effectiveness of our approach. The experimental results indicate that deep spectrum representations extracted from the proposed model are well-suited to the task of SER, achieving a WA of 68.1 % and a UA of 67.0 % on IEMOCAP, and 45.4% for UA on FAU-AEC dataset. Key results indicate that the extracted deep representations combined with a linear support vector classifier are comparable in performance with eGeMAPS and COMPARE, two standard acoustic feature representations.

show abstract

Attention-based Sequence Classification for Affect Detection

Cited by 13 publications

References 11 publications

Comparison of Artificial Neural Network Types for Infant Vocalization Classification

Comparison of Artificial Neural Network Types for Infant Vocalization Classification

EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition

Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition

Contact Info

Product

Resources

About