Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding

Wang, Dexin; Xiong, Deyi

doi:10.1609/aaai.v35i4.16376

Cited by 21 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, we investigate the impact of the enhanced vision features on MMT. Previous studies have already attempted to leverage object-detection features Wang and Xiong, 2021) but the observation here is slightly different. Beyond the object-detection pretrained features, we also take the image captioning task into account.…”

Section: Impact Of Learning Objectivesmentioning

confidence: 88%

On Vision Features in Multimodal Machine Translation

Li¹,

Lv²,

Zhou³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at https: //github.com/libeineu/fairseq_mmt.

show abstract

Section: Impact Of Learning Objectivesmentioning

confidence: 88%

On Vision Features in Multimodal Machine Translation

Li¹,

Lv²,

Zhou³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…propose a cross-lingual visual pre-training method and fine-tuned for MMT. It is worth noting that some of previous works (Ive et al, 2019;Lin et al, 2020;Wang and Xiong, 2021;Nishihara et al, 2020; adopt regional visual information like us, which shows effectiveness compared with global visual features. The major difference between our method and theirs is that our method is a retrieval-based method, which breaks the reliance on bilingual sentence-image pairs, Therefore, our method is still applicable when the input is text only (without paired images), which is unfortunately not available with those previous methods.…”

Section: Related Workmentioning

confidence: 95%

“…);Nishihara et al (2020);Wang and Xiong (2021) propose auxiliary loss to allow the model to make better use of visual information.Caglayan et al (2019); Wu et al (2021) conduct systematic analysis to probe the contribution of visual modality. Caglayan et al (2020); Ive et al (2021) focus on improving simultaneous machine translation with visual context.…”

mentioning

confidence: 99%

Neural Machine Translation with Phrase-Level Universal Visual Representations

Qingkai¹,

Feng²

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs. In this paper, we propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets so that MMT can break the limitation of paired sentence-image input. Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region, which can mitigate data sparsity. Furthermore, our method employs the conditional variational auto-encoder to learn visual representations which can filter redundant visual information and only retain visual information related to the phrase. Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets, especially when the textual context is limited.

show abstract

“…An interesting focal point is the possibility of using text data from different languages. Multilingual cross-modal architectures have enjoyed some interest [150], with attempts to utilize unpaired samples [17]. Given the relatively limited nature of hand-made datasets and the problems with scaling supervised approaches on such data, there is significant potential in the area focusing on the move from classic paired supervised training examples to more loosely coupled data points.…”

Section: Directions For Future Researchmentioning

confidence: 99%