Multimodal Neural Machine Translation Using CNN and Transformer Encoder

Takushima, H.; Tamura, Akihiro; Ninomiya, Takashi; Nakayama, Hideki

doi:10.29007/hxhn

“…8 Grönroos et al 2018;Ive et al 2019;Zhang et al 2020) (Delbrouck and Dupont 2017;Zhang et al 2020) (Calixto et al 2017;Helcl et al 2018;Libovický et al 2018;Ive et al 2019;Takushima et al 2019)…”

Section: 2unclassified

Supervised Visual Attention for Multimodal Neural Machine Translation

Nishihara

¹

,

Tamura

²

,

Ninomiya

³

et al. 2021

Journal of Natural Language Processing

Self Cite

View full text Add to dashboard Cite

This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).

show abstract

“…Most MNMT models have incorporated an input image's features with a visual attention mechanism. Some studies have introduced a visual attention mechanism that captures relationships between source language words and image regions (Delbrouck and Dupont, 2017;Zhang et al, 2020), while others have used a visual attention mechanism that captures relationships between target language words and image regions (Calixto et al, 2017;Ive et al, 2019;Takushima et al, 2019). Note that these visual attention mechanisms were trained in an unsupervised manner, and, as far as we know, a supervised visual attention mechanism has not yet been proposed.…”

Section: Experiments With Manual Word Alignmentsmentioning

confidence: 99%

Supervised Visual Attention for Multimodal Neural Machine Translation

Nishihara¹,

Tamura²,

Ninomiya³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).

show abstract