2021
DOI: 10.48550/arxiv.2108.12009
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa

Abstract: We present EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa, a simple yet expressive scheme of solving the ERC (emotion recognition in conversation) task. By simply prepending speaker names to utterances and inserting separation tokens between the utterances in a dialogue, EmoBERTa can learn intra-and inter-speaker states and context to predict the emotion of a current speaker, in an end-to-end manner. Our experiments show that we reach a new state of the art on the two popular ERC data… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
29
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(30 citation statements)
references
References 22 publications
(29 reference statements)
1
29
0
Order By: Relevance
“…Authors in [7] and [28] use Graph Neural networks to encode inter utterance and inter speaker relationships. Kim et al [11] model contextual information by simply prepending speaker names to utterances and inserting separation tokens between the utterances in a dialogue. To generate contextualized utterance representations, Wang et al [33] uses LSTM-based encoders to capture self and inter-speaker dependency of interlocutors.…”
Section: Text-based Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Authors in [7] and [28] use Graph Neural networks to encode inter utterance and inter speaker relationships. Kim et al [11] model contextual information by simply prepending speaker names to utterances and inserting separation tokens between the utterances in a dialogue. To generate contextualized utterance representations, Wang et al [33] uses LSTM-based encoders to capture self and inter-speaker dependency of interlocutors.…”
Section: Text-based Methodsmentioning
confidence: 99%
“…Text: In order to provide deeper inter utterance context, the text modality data (i.e., x t ) are passed through the Text Feature Extractor module. Here, we employ a modified RoBERTa model (φ M −RoBERT a ) proposed by Kim et al [11] as feature extractor. Every utterance's x t is accompanied by its preceding and next utterance text separated by the separator token < S >.…”
Section: Utterance Level Feature Extractionmentioning
confidence: 99%
See 1 more Smart Citation
“…The motivation behind choosing this architecture is the fact that this is one of the few simple and straightforward speakerindependent multimodal architectures for emotion recognition, which makes interpreting its decisions more convenient. The current state-of-the-art methods [28] [29] for emotion recognition (in conversation) on IEMOCAP make use of speakerspecific components to enhance performance, which is outside the scope of our work. Contextual Hierarchical Fusion [19] extends the idea of contextual information to 3 hierarchical levels but provides only a marginal improvement over BC-LSTM, thereby making BC-LSTM the appropriate choice for our work.…”
Section: A Modelmentioning
confidence: 99%
“…They explain that successfully incorporating expressive speech into HCI, involves two aspects: (a) prosodic emotion recognition and (b) expression of emotional prosody. Considerable effort has been made towards recognizing and predicting the emotional nuances in human dialogues (Kim and Vossen, 2021;Poria et al, 2019b;Zhu et al, 2021;Li et al, 2017;Poria et al, 2021;Vinyals and Le, 2015). However, current TTS systems are yet to improve on rendering emotive or expressive speech for real-world HCI systems.…”
Section: Introductionmentioning
confidence: 99%