Context-Aware Attention Mechanism for Speech Emotion Recognition

Ramet, Gaetan; Garner, Philip N.; Baeriswyl, Michael; Lazaridis, Alexandros

doi:10.1109/slt.2018.8639633

Cited by 42 publications

(43 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We investigate the proposed method using a simple LSTM model and a small-size Transformer model on the IEMOCAP dataset (Busso et al, 2008), composed of five acted sessions, for a fourclass emotions classification and we compare to the state of the art (Mirsamadi et al, 2017) model, a local attention based BiLSTM. Ramet et al (2018) showed in their work a new model that is competitive to the one previously cited, following a cross-valiadation evaluation schema. For a fair comparison, in this paper we focus on a non-crossvaliation schema and thus compare our results to the work of Mirsamadi et al (2017), where a similar schema is followed using as evaluation set the fifth session of IEMOCAP database.…”

Section: Related Workmentioning

confidence: 97%

See 1 more Smart Citation

Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

Kocher¹,

Scuito²,

Tarantino³

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Self Cite

View full text Add to dashboard Cite

In sequence modeling tasks the token order matters, but this information can be partially lost due to the discretization of the sequence into data points. In this paper, we study the imbalance between the way certain token pairs are included in data points and others are not. We denote this a token order imbalance (TOI) and we link the partial sequence information loss to a diminished performance of the system as a whole, both in text and speech processing tasks. We then provide a mechanism to leverage the full token order information-Alleviated TOI-by iteratively overlapping the token composition of data points. For recurrent networks, we use prime numbers for the batch size to avoid redundancies when building batches from overlapped data points. The proposed method achieved state of the art performance in both text and speech related tasks.

show abstract

Section: Related Workmentioning

confidence: 97%

“…OpenSMILE (Eyben et al, 2013) is used for extracting the features. We opt for the IS09 features set (Schuller et al, 2009) as proposed by Ramet et al (2018) and commonly used for SER.…”

Section: Toi In Speech Emotion Recognitionmentioning

confidence: 99%

Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

Kocher¹,

Scuito²,

Tarantino³

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The relatively small amount of training data in our case, only 5.5 hours of speech, could lead to a partial learning of the input representation. As for the engineered features, we evaluated our methodology on the IS09 [17] features set (384 features) because it is a common set used for SER tasks and it has been used by [5] to get the latest state of the art results. Even if it is not been used as extensively as IS09, we extracted also the eGeMaps set [18]: this set showed to be a good substitute of IS09 in several works, such as [19], [20] and [21].…”

Section: Input Featuresmentioning

confidence: 99%

“…IEMOCAP database [23] was chosen for our experiments since it has been established in the literature on the SER field as a benchmark. Moreover, it contains high frequency recording audio data (16kHz sample rate), both genders, 9 emotions and improvised and scripted speech, that the literature showed to have different complexity when making inference [4], [5]. Out of the 9 emotions we focused on four of them (angry, happy, neutral and sad) in order to have comparable results with previous research.…”

Section: Databasementioning

confidence: 99%

“…Neumann et al [4] used attention on top of a convolutional neural network, showing that convolutions could tackle the problem with similar performance. Ramet et al [5] investigated several attention methods ( [6,7,8,9,10]) on the SER task. They also proposed a new attention method applied to BiRNNs with LSTM cells using recurrent layers in the inner computation of the attention.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Self-Attention for Speech Emotion Recognition

Tarantino¹,

Garner

Lazaridis

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly conflicting, annotations in the training data, as soft targets could outperform a majority voting. We prove that this performance increases with the agreement level of the annotators.

show abstract

Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network

Luo

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Context-Aware Attention Mechanism for Speech Emotion Recognition

Cited by 42 publications

References 12 publications

Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

Self-Attention for Speech Emotion Recognition

Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network

Contact Info

Product

Resources

About