Segment-based speech emotion recognition using recurrent neural networks

Tzinis, Efthymios; Potamianos, Alexandras

doi:10.1109/acii.2017.8273599

Cited by 58 publications

(36 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The same applies for LR (in WA from 0.8% to 1.0% and in UA from 0.3% to 1.0% ) as well as for A-BLSTM (in WA from 0.1% to 0.7% and in UA from 0.2% to 0.7%). In accordance with our intuition [8], a segment-based approach using A-BLSTM surpasses all utterance-based ones in WA from 3.4% to 8.4% and in UA from 3.8% to 6.8% for all normalization schemes, when the fused set is used.…”

Section: Leave One Session Out (Loso)supporting

confidence: 86%

“…Moreover, segmentbased approaches have showcased that computation of statistical functionals over LLDs in appropriate timescales yields a significant performance improvement for SER systems [7], [8]. Specifically, in [8] statistical representations are extracted from overlapping segments, each one corresponding to a couple of words. The resulting sequence of segments representations is fed as input to a Long Short Time Memory (LSTM) unit for SER classification.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Integrating Recurrence Dynamics for Speech Emotion Recognition

Tzinis¹,

Paraskevopoulos²,

Baziotis³

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

We investigate the performance of features that can capture nonlinear recurrence dynamics embedded in the speech signal for the task of Speech Emotion Recognition (SER). Reconstruction of the phase space of each speech frame and the computation of its respective Recurrence Plot (RP) reveals complex structures which can be measured by performing Recurrence Quantification Analysis (RQA). These measures are aggregated by using statistical functionals over segment and utterance periods. We report SER results for the proposed feature set on three databases using different classification methods. When fusing the proposed features with traditional feature sets, e.g., [1], we show an improvement in unweighted accuracy of up to 5.7% and 10.7% on Speaker-Dependent (SD) and Speaker-Independent (SI) SER tasks, respectively, over the baseline [1]. Following a segment-based approach we demonstrate state-ofthe-art performance on IEMOCAP using a Bidirectional Recurrent Neural Network.

show abstract

Section: Leave One Session Out (Loso)supporting

confidence: 86%

Section: Introductionmentioning

confidence: 99%

Integrating Recurrence Dynamics for Speech Emotion Recognition

Tzinis¹,

Paraskevopoulos²,

Baziotis³

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…The proposed model outperforms the state-ofthe-art models on both the improvised partition and the full IEMOCAP dataset, in terms of WA and UA. 1 Results from [11,12,13,14] were rounded to one decimal digit.…”

Section: Resultsmentioning

confidence: 99%

Context-Aware Attention Mechanism for Speech Emotion Recognition

Ramet

Garner

Baeriswyl

et al. 2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

In this work, we study the use of attention mechanisms to enhance the performance of the state-of-the-art deep learning model in Speech Emotion Recognition (SER). We introduce a new Long Short-Term Memory (LSTM)-based neural network attention model which is able to take into account the temporal information in speech during the computation of the attention vector. The proposed LSTM-based model is evaluated on the IEMOCAP dataset using a 5-fold cross-validation scheme and achieved 68.8% weighted accuracy on 4 classes, which outperforms the state-of-the-art models.

show abstract

“…Tzinis and Potamianos [4] run a study on both local and global features and evaluate the performance at various time-scales (frame, phoneme, word or utterance). The result shows that, global statistical feature extracted from speech segment that correspond to the duration of few words yield optimal accuracy using Recurrent Neural Networks (RNNs).…”

Section: Related Workmentioning

confidence: 99%

“…The appropriate time-scale selection is crucial to produce a high performance SER system. Emotional features can be categorized into two types of time scale: 1) Low Level Descriptor (LLDs) known as local features and 2) Statistical function, known as global feature [4]. Local features define the temporal dynamics in the prosody while statistic value such as minimum, maximum, mean, standard deviation, and slope of the contours highlights the global features [5].…”

Section: Introductionmentioning

confidence: 99%

Emotional speech feature selection using end-part segmented energy feature

Zaidan

Salam

2019

IJEECS

View full text Add to dashboard Cite

The accuracy of human emotional detection is crucial in the industry to ensure effective conversations and messages delivery. The process involved in identifying emotions must be carried out properly and using a method that guarantees high level of emotional recognition. Energy feature is said to be a prosodic information encoder and there are still studies on energy use in speech prosody and it motivate us to run an experiment on energy features. We have conducted two sets of studies: 1) whether local or global features that contribute most to emotional recognition and 2) the effect of the end-part segment length towards emotion recognition accuracy using 2 types of segmentation approach. This paper discussed about Absolute Time Intervals at Relative Positions (ATIR) segmentation approach and global ATIR (GATIR) using end-part segmented global energy feature extracted from Berlin Emotional Speech Database (EMO-DB). We observed that global feature contribute more to the emotional recognition and global features that are derived from longer segments give higher recognition accuracy than global feature derived from short segments. The addition of utterance-based feature (GTI) to ATIR segmentation somewhat contributes to increase the accuracy by 5% up to 8% and conclude that GATIR outperformed ATIR segmentation approached in term of its higher recognition rate. The results of this study where almost all the sub-tests provide an increased result proving that global feature derived from longer segment lengths acquire more emotional information and enhance the system performance.

show abstract

Segment-based speech emotion recognition using recurrent neural networks

Cited by 58 publications

References 18 publications

Integrating Recurrence Dynamics for Speech Emotion Recognition

Integrating Recurrence Dynamics for Speech Emotion Recognition

Context-Aware Attention Mechanism for Speech Emotion Recognition

Emotional speech feature selection using end-part segmented energy feature

Contact Info

Product

Resources

About