Various syncretic co‐attention network for multimodal sentiment analysis

Cao, Meng; Zhu, Yonghua; Gao, Wenjing; Li, Mengyao; Wang, Shaoxiu

doi:10.1002/cpe.5954

Cited by 6 publications

(5 citation statements)

References 46 publications

(109 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to assess the effectiveness of our proposed approach, we have undertaken a comparative analysis between our investigation utilizing the BG dataset and the current literature: [11], [16]- [18], [23], [24]. The comparison results in Group 1 of Table 4 demonstrate that, regarding the F1-score (92.60%) and accuracy (92.65%), the DMLANet model outperformed AMGN.…”

Section: E Comparative Results and Discussionmentioning

confidence: 99%

“…However, this model suffered from excessive memory overhead due to its lengthy execution time. Cao et al [24] proposed various syncretic co-attention networks (VSCN) to investigate multi-level matching correlations across multimodal information and consider each modality's specific characteristics for integrated sentiment classification. However, the emotion polarity is frequently unclear because visual components convey more information than text, causing the model to generate incorrect predictions occasionally.…”

Section: Literature Review a Multimodal Sentiment Analysismentioning

confidence: 99%

See 1 more Smart Citation

Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data

Al-Tameemi,

Feizi-Derakhshi,

Pashazadeh

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Multimodal data can convey user emotions and feelings more effectively and interactively than unimodal content. Thus, multimodal sentiment analysis (MSA) research has recently acquired great significance as a field of study. However, most current approaches either acquire sentimental features independently for each modality or simply combine multiple modal features. Thus, semantic details pertinent to sentiment analysis and the relationship between visual and textual content have been neglected. Furthermore, most available multimodal datasets are sentiment-annotated, although user emotions are usually rich and unlimited. Motivated by these observations, this paper proposes a novel deep multi-view attentive network (DMVAN) for robust multimodal sentiment and emotion classification. The DMVAN model has three phases: feature learning, attentive interaction learning, and cross-modal fusion learning. During the feature learning phase, visual features from a multi-view perspective (region and scene) and textual features from various levels of analysis (word, sentence, and document) are extracted to capture information effectively for accurate classification. In the attentive interaction learning phase, we employed the image-text interaction learning mechanism to enhance visual and textual information interaction by extracting sentimental and discriminative visual features and utilizing the textual information to guide the learning process of image features. Moreover, a cross-modal fusion learning module has been developed to integrate the various features into a comprehensive framework that leverages the complementary nature of multiple modalities, followed by a multi-head attention mechanism that is intended to extract sufficient information from the intermediate features, thereby enabling the creation of a robust joint representation. Finally, a multi-layer perceptron with multiple stacking-fully connected layers is used to deeply fuse the modal features, thereby enhancing sentiment classification performance. An interpretable multimodal sentiment classification model is further developed utilizing the local interpretable model-agnostic explanation model (LIME) to ensure the model's explainability and strength. To perform a multimodal emotion classification, an image-text emotion dataset named emotion-Getty (EMO-G) was constructed from Getty Images and labeled by distinct emotions. The proposed model was tested on three real-world datasets, attaining 99.801% accuracy on binary_Getty (BG), 96.867% on Twitter, and 96.174% on the EMO-G dataset. These results show that the suggested model outperforms single-model techniques and current stateof-the-art methodologies based on model evaluation criteria. INDEX TERMS Multimodal sentiment analysis, attention mechanism, deep learning, deep multi-view attentive network, interpretability

show abstract

Section: E Comparative Results and Discussionmentioning

confidence: 99%

Section: Literature Review a Multimodal Sentiment Analysismentioning

confidence: 99%

Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data

Al-Tameemi,

Feizi-Derakhshi,

Pashazadeh

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…When applying attention mechanisms to images, different feature vectors associated with different regions are assigned different weights to create an attended image vector, as seen in the work of Zhang et al [ 3 ]. In contrast, Cao et al [ 4 ] adopt an asymmetric attention framework to generate attended image and textual feature vectors, while Xu et al [ 5 ] use a dual attention network (DAN) to simultaneously predict the attention distribution of both the image and the text. Unlike collaborative attention, where an asymmetric attention framework is used to generate attended feature vectors, memory vectors can be repeatedly modified at each inference level using a repeated DAN structure.…”

Section: Related Workmentioning

confidence: 99%

UsbVisdaNet: User Behavior Visual Distillation and Attention Network for Multimodal Sentiment Classification

Hou

Tuerhong

Wushouer

2023

Sensors

View full text Add to dashboard Cite

In sentiment analysis, biased user reviews can have a detrimental impact on a company’s evaluation. Therefore, identifying such users can be highly beneficial as their reviews are not based on reality but on their characteristics rooted in their psychology. Furthermore, biased users may be seen as instigators of other prejudiced information on social media. Thus, proposing a method to help detect polarized opinions in product reviews would offer significant advantages. This paper proposes a new method for sentiment classification of multimodal data, which is called UsbVisdaNet (User Behavior Visual Distillation and Attention Network). The method aims to identify biased user reviews by analyzing their psychological behaviors. It can identify both positive and negative users and improves sentiment classification results that may be skewed due to subjective biases in user opinions by leveraging user behavior information. Through ablation and comparison experiments, the effectiveness of UsbVisdaNet is demonstrated, achieving superior sentiment classification performance on the Yelp multimodal dataset. Our research pioneers the integration of user behavior features, text features, and image features at multiple hierarchical levels within this domain.

show abstract

“…As a result, many multimodal sentiment categorization approaches have been proposed to incorporate diverse modalities. These approaches are classified into three distinct categories: early/feature fusion [17,18], intermediate/joint fusion [19][20][21][22][23][24][25][26][27][28], and late/decision fusion [29][30][31]. In the early fusion approach, a unified feature vector is created first, and then a Machine Learning (ML) classifier is fed with the features extracted from the input data.…”

Section: Literature Review 21 Visual-textual Sentiment Analysismentioning

confidence: 99%

“…The way textual and visual features are extracted and incorporated allows the model to achieve robust performance. Cao et al [22] proposed Various Syncretic Co-attention Networks (VSCN) to investigate multi-level matching correlations between multimodal data and incorporate each modality's unique information for integrated sentiment classification. However, the emotion polarity could be clearer because visual components convey more information than text, causing the model to generate incorrect predictions occasionally.…”

Section: Literature Review 21 Visual-textual Sentiment Analysismentioning

confidence: 99%

Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification

K. Salman Al-Tameemi,

Feizi-Derakhshi,

Pashazadeh

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Multimodal Sentiment Analysis (SA) is gaining popularity due to its broad application potential. The existing studies have focused on the SA of single modalities, such as texts or photos, posing challenges in effectively handling social media data with multiple modalities. Moreover, most multimodal research has concentrated on merely combining the two modalities rather than exploring their complex correlations, leading to unsatisfactory sentiment classification results. Motivated by this, we propose a new visualtextual sentiment classification model named Multi-Model Fusion (MMF), which uses a mixed fusion framework for SA to effectively capture the essential information and the intrinsic relationship between the visual and textual content. The proposed model comprises three deep neural networks. Two different neural networks are proposed to extract the most emotionally relevant aspects of image and text data. Thus, more discriminative features are gathered for accurate sentiment classification. Then, a multichannel joint fusion model with a self-attention technique is proposed to exploit the intrinsic correlation between visual and textual characteristics and obtain emotionally rich information for joint sentiment classification. Finally, the results of the three classifiers are integrated using a decision fusion scheme to improve the robustness and generalizability of the proposed model. An interpretable visual-textual sentiment classification model is further developed using the Local Interpretable Model-agnostic Explanation model (LIME) to ensure the model's explainability and resilience. The proposed MMF model has been tested on four real-world sentiment datasets, achieving (99.78%) accuracy on Binary_Getty (BG), (99.12%) on Binary_iStock (BIS), (95.70%) on Twitter, and (79.06%) on the Multi-View Sentiment Analysis (MVSA) dataset. These results demonstrate the superior performance of our MMF model compared to single-model approaches and current state-of-the-art techniques based on model evaluation criteria.

show abstract

Various syncretic co‐attention network for multimodal sentiment analysis

Cited by 6 publications

References 46 publications

Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data

Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data

UsbVisdaNet: User Behavior Visual Distillation and Attention Network for Multimodal Sentiment Classification

Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification

Contact Info

Product

Resources

About