“…Various transformer-based VQA models , Su et al, 2019, Li et al, 2019b,a, Zhou et al, 2019, Chefer et al, 2021 have been introduced in the last few years. Among them, [Tan and Bansal, 2019] and are two-stream transformer architectures that use cross-attention layers and co-attention layers, respectively, to allow information exchange across modalities. There are several studies on the interpretability of VQA models [Goyal et al, 2016, Kafle and Kanan, 2017, Jabri et al, 2016, and yet very few have focused on the co-attention transformer layers used in recent VQA models.…”