MultiViz: Towards Visualizing and Understanding Multimodal Models

Liang, Paul Pu; Lyu, Yiwei; Chhablani, Gunjan; Jain, Nihal; Deng, Zihao; Wang, Fudi; Morency, Louis–Philippe; Salakhutdinov, Ruslan

doi:10.48550/arxiv.2207.00056

Cited by 2 publications

(2 citation statements)

References 56 publications

(143 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It uses four modality attention heads: languageto-vision attention, vision-to-language attention, languageto-language attention, and vision-to-vision attention, allowing it to look at interactions within and between modalities. MULTIVIZ (Liang et al 2022) is another method to analyze multimodal models interpreting unimodal interactions, cross-modal interactions, multi-modal representations, and multimodal prediction. gScoreCAM (Chen et al 2022) studied the CLIP (Radford et al 2021) model specifically to understand large multimodal models.…”

Section: Gradient-based and Visualization-based Methodsmentioning

confidence: 99%

Adventures of Trustworthy Vision-Language Models: A Survey

Vatsa,

Jain,

Singh

2024

AAAI

View full text Add to dashboard Cite

Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.

show abstract

Section: Gradient-based and Visualization-based Methodsmentioning

confidence: 99%

Adventures of Trustworthy Vision-Language Models: A Survey

Vatsa,

Jain,

Singh

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…While the experiments performed only involve the visual and text modalities due to the high computational cost of the method, user evaluations show that DIME can help researchers determine which unimodal or multimodal contributions are the dominant factors behind the model's prediction. An improvement of DIME aimed at improving its scalability is introduced in MULTIVIZ [183], a tool for analyzing the behavior of multimodal models that scaffold the problem of interpretability into unimodal importance, cross-modal interactions, multimodal representations, and multimodal predictions.…”

Section: A Postmodel Xai Applied On Discrete Sets Of Unimodal Inputsmentioning

confidence: 99%

Toward Explainable Affective Computing: A Review

Cortiñas-Lorenzo,

Lacey

2024

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Affective computing has an unprecedented potential to change the way humans interact with technology. While the last decades have witnessed vast progress in the field, multimodal affective computing systems are generally black box by design. As affective systems start to be deployed in real-world scenarios, such as education or healthcare, a shift of focus toward improved transparency and interpretability is needed. In this context, how do we explain the output of affective computing models? and how to do so without limiting predictive performance? In this article, we review affective computing work from an explainable AI (XAI) perspective, collecting and synthesizing relevant papers into three major XAI approaches: premodel (applied before training), in-model (applied during training), and postmodel (applied after training). We present and discuss the most fundamental challenges in the field, namely, how to relate explanations back to multimodal and time-dependent data, how to integrate context and inductive biases into explanations using mechanisms such as attention, generative modeling, or graph-based methods, and how to capture intramodal and cross-modal interactions in post hoc explanations. While explainable affective computing is still nascent, existing methods are promising, contributing not only toward improved transparency but, in many cases, surpassing state-of-the-art results. Based on these findings, we explore directions for future research and discuss the importance of data-driven XAI and explanation goals, and explainee needs definition, as well as causability or the extent to which a given method leads to human understanding.

show abstract

MultiViz: Towards Visualizing and Understanding Multimodal Models

Cited by 2 publications

References 56 publications

Adventures of Trustworthy Vision-Language Models: A Survey

Adventures of Trustworthy Vision-Language Models: A Survey

Toward Explainable Affective Computing: A Review

Contact Info

Product

Resources

About