“…(transformers) on current sentence-level language tasks is still under debate (Yun et al, 2021;Iki and Aizawa, 2021;Tan and Bansal, 2020). While some approaches report slight improvements (Sileo, 2021), it is mostly believed that visually grounded transformer models such as VL-BERT (Su et al, 2019) not only bring no improvements for language tasks but they might distort the linguistic knowledge obtained from textual corpora for solving the natural language understanding tasks (Tan and Bansal, 2020;Yun et al, 2021). The main backbone of all transformers is stacking multiple attention layers (Vaswani et al, 2017), briefly explained in Section 6.…”