EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa

Kim, Tae-Woon; Vossen, Piek

doi:10.48550/arxiv.2108.12009

Cited by 21 publications

(30 citation statements)

References 22 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Authors in [7] and [28] use Graph Neural networks to encode inter utterance and inter speaker relationships. Kim et al [11] model contextual information by simply prepending speaker names to utterances and inserting separation tokens between the utterances in a dialogue. To generate contextualized utterance representations, Wang et al [33] uses LSTM-based encoders to capture self and inter-speaker dependency of interlocutors.…”

Section: Text-based Methodsmentioning

confidence: 99%

“…Text: In order to provide deeper inter utterance context, the text modality data (i.e., x t ) are passed through the Text Feature Extractor module. Here, we employ a modified RoBERTa model (φ M −RoBERT a ) proposed by Kim et al [11] as feature extractor. Every utterance's x t is accompanied by its preceding and next utterance text separated by the separator token < S >.…”

Section: Utterance Level Feature Extractionmentioning

confidence: 99%

“…In literature, we can see that many state-of-the-art methods adopt text-based processing to perform robust ERC [6,11], such methods do not take into consideration the vast amount of information present in the acoustic and visual modalities. Since the ERC data mainly consists of all three modalities, i.e., text, visual, and acoustic, we hypothesize that the robust fusion of these modalities can improve the performance and robustness of the existing systems.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Chudasama¹,

Kar²,

Gudmalwar³

et al. 2022

Preprint

View full text Add to dashboard Cite

Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive marginbased triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-theart performance in ERC.

show abstract

Section: Text-based Methodsmentioning

confidence: 99%

Section: Utterance Level Feature Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Chudasama¹,

Kar²,

Gudmalwar³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The motivation behind choosing this architecture is the fact that this is one of the few simple and straightforward speakerindependent multimodal architectures for emotion recognition, which makes interpreting its decisions more convenient. The current state-of-the-art methods [28] [29] for emotion recognition (in conversation) on IEMOCAP make use of speakerspecific components to enhance performance, which is outside the scope of our work. Contextual Hierarchical Fusion [19] extends the idea of contextual information to 3 hierarchical levels but provides only a marginal improvement over BC-LSTM, thereby making BC-LSTM the appropriate choice for our work.…”

Section: A Modelmentioning

confidence: 99%

Interpretability for Multimodal Emotion Recognition using Concept Activation Vectors

Asokan¹,

Kumar²,

Ragam³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multimodal Emotion Recognition refers to the classification of input video sequences into emotion labels based on multiple input modalities (usually video, audio and text). In recent years, Deep Neural networks have shown remarkable performance in recognizing human emotions, and are on par with human-level performance on this task. Despite the recent advancements in this field, emotion recognition systems are yet to be accepted for real world setups due to the obscure nature of their reasoning and decision-making process. Most of the research in this field deals with novel architectures to improve the performance for this task, with a few attempts at providing explanations for these models' decisions. In this paper, we address the issue of interpretability for neural networks in the context of emotion recognition using Concept Activation Vectors (CAVs). To analyse the model's latent space, we define humanunderstandable concepts specific to Emotion AI and map them to the widely-used IEMOCAP multimodal database. We then evaluate the influence of our proposed concepts at multiple layers of the Bi-directional Contextual LSTM (BC-LSTM) network to show that the reasoning process of neural networks for emotion recognition can be represented using human-understandable concepts. Finally, we perform hypothesis testing on our proposed concepts to show that they are significant for interpretability of this task.

show abstract

“…They explain that successfully incorporating expressive speech into HCI, involves two aspects: (a) prosodic emotion recognition and (b) expression of emotional prosody. Considerable effort has been made towards recognizing and predicting the emotional nuances in human dialogues (Kim and Vossen, 2021;Poria et al, 2019b;Zhu et al, 2021;Li et al, 2017;Poria et al, 2021;Vinyals and Le, 2015). However, current TTS systems are yet to improve on rendering emotive or expressive speech for real-world HCI systems.…”

Section: Introductionmentioning

confidence: 99%

Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Kosgi¹,

Sivaprasad²,

Pedanekar³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue. Audio samples for our experiments and the code are available at: https: //emtts.github.io/tts-demo/

show abstract

EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa

Cited by 21 publications

References 22 publications

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Interpretability for Multimodal Emotion Recognition using Concept Activation Vectors

Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Contact Info

Product

Resources

About