“…Most MNMT models have incorporated an input image's features with a visual attention mechanism. Some studies have introduced a visual attention mechanism that captures relationships between source language words and image regions (Delbrouck and Dupont, 2017;Zhang et al, 2020), while others have used a visual attention mechanism that captures relationships between target language words and image regions (Calixto et al, 2017;Ive et al, 2019;Takushima et al, 2019). Note that these visual attention mechanisms were trained in an unsupervised manner, and, as far as we know, a supervised visual attention mechanism has not yet been proposed.…”