Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2312
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Speech Emotion Recognition Using Cross Attention with Aligned Audio and Text

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 7 publications
0
6
0
Order By: Relevance
“…To align with previous studies [20], we use 7,487 utterances from seven emotions: frustration, neutral, anger, sadness, excitement, happiness, surprise. Since there is no standard split for this dataset, we follow [20,14] to perform 10-fold cross-validation, where 8:1:1 are used for training, validation and test, respectively. The weighted accuracy (WA, i.e., the overall accuracy) and unweighted accuracy (UA, i.e., the average accuracy over all emotion categories) is adopted as the evaluation metrics.…”
Section: Datasetsmentioning
confidence: 99%
See 2 more Smart Citations
“…To align with previous studies [20], we use 7,487 utterances from seven emotions: frustration, neutral, anger, sadness, excitement, happiness, surprise. Since there is no standard split for this dataset, we follow [20,14] to perform 10-fold cross-validation, where 8:1:1 are used for training, validation and test, respectively. The weighted accuracy (WA, i.e., the overall accuracy) and unweighted accuracy (UA, i.e., the average accuracy over all emotion categories) is adopted as the evaluation metrics.…”
Section: Datasetsmentioning
confidence: 99%
“…4. CAN [14] aggregates the sequential information from aligned audio and text by using the attention weights of each modality in a normal and crossed way.…”
Section: Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…Then, the connection between features and annotations is characterized by SVR with its scaling parameter in {0.01n F , 0.1n F , n F , 10n F } and the same selections of regularization parameter as in the SVMs. In addition, neural networks are employed considering 12 selections of the hidden-layer neurons as: (32,8), (32,16), (64,16), . .…”
Section: A Experimental Preparation 1) Corpus and Featuresmentioning
confidence: 99%
“…T HE past two decades witnessed the rapid progressing in paralinguistics of auditory affective computing [1]- [3], consisting of emotion recognition in speech [4]- [6], music [7], and multimodal conditions [8]. Typically, in the research of speech emotion recognition (SER), machines learn to perceive emotional information in speech with the learning procedures appearing in certain settings [5].…”
mentioning
confidence: 99%