Abstract:The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? … Show more
“…It uses four modality attention heads: languageto-vision attention, vision-to-language attention, languageto-language attention, and vision-to-vision attention, allowing it to look at interactions within and between modalities. MULTIVIZ (Liang et al 2022) is another method to analyze multimodal models interpreting unimodal interactions, cross-modal interactions, multi-modal representations, and multimodal prediction. gScoreCAM (Chen et al 2022) studied the CLIP (Radford et al 2021) model specifically to understand large multimodal models.…”
Section: Gradient-based and Visualization-based Methodsmentioning
Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.
“…It uses four modality attention heads: languageto-vision attention, vision-to-language attention, languageto-language attention, and vision-to-vision attention, allowing it to look at interactions within and between modalities. MULTIVIZ (Liang et al 2022) is another method to analyze multimodal models interpreting unimodal interactions, cross-modal interactions, multi-modal representations, and multimodal prediction. gScoreCAM (Chen et al 2022) studied the CLIP (Radford et al 2021) model specifically to understand large multimodal models.…”
Section: Gradient-based and Visualization-based Methodsmentioning
Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.
“…While the experiments performed only involve the visual and text modalities due to the high computational cost of the method, user evaluations show that DIME can help researchers determine which unimodal or multimodal contributions are the dominant factors behind the model's prediction. An improvement of DIME aimed at improving its scalability is introduced in MULTIVIZ [183], a tool for analyzing the behavior of multimodal models that scaffold the problem of interpretability into unimodal importance, cross-modal interactions, multimodal representations, and multimodal predictions.…”
Section: A Postmodel Xai Applied On Discrete Sets Of Unimodal Inputsmentioning
Affective computing has an unprecedented potential to change the way humans interact with technology. While the last decades have witnessed vast progress in the field, multimodal affective computing systems are generally black box by design. As affective systems start to be deployed in real-world scenarios, such as education or healthcare, a shift of focus toward improved transparency and interpretability is needed. In this context, how do we explain the output of affective computing models? and how to do so without limiting predictive performance? In this article, we review affective computing work from an explainable AI (XAI) perspective, collecting and synthesizing relevant papers into three major XAI approaches: premodel (applied before training), in-model (applied during training), and postmodel (applied after training). We present and discuss the most fundamental challenges in the field, namely, how to relate explanations back to multimodal and time-dependent data, how to integrate context and inductive biases into explanations using mechanisms such as attention, generative modeling, or graph-based methods, and how to capture intramodal and cross-modal interactions in post hoc explanations. While explainable affective computing is still nascent, existing methods are promising, contributing not only toward improved transparency but, in many cases, surpassing state-of-the-art results. Based on these findings, we explore directions for future research and discuss the importance of data-driven XAI and explanation goals, and explainee needs definition, as well as causability or the extent to which a given method leads to human understanding.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.