2021
DOI: 10.1609/aaai.v35i4.16376
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding

Abstract: Visual context provides grounding information for multimodal machine translation (MMT). However, previous MMT models and probing studies on visual features suggest that visual information is less explored in MMT as it is often redundant to textual information. In this paper, we propose an Object-level Visual Context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation. With detected objects, the proposed OVC encourages MMT to ground translation on de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(4 citation statements)
references
References 17 publications
0
1
0
Order By: Relevance
“…Then, we investigate the impact of the enhanced vision features on MMT. Previous studies have already attempted to leverage object-detection features Wang and Xiong, 2021) but the observation here is slightly different. Beyond the object-detection pretrained features, we also take the image captioning task into account.…”
Section: Impact Of Learning Objectivesmentioning
confidence: 88%
“…Then, we investigate the impact of the enhanced vision features on MMT. Previous studies have already attempted to leverage object-detection features Wang and Xiong, 2021) but the observation here is slightly different. Beyond the object-detection pretrained features, we also take the image captioning task into account.…”
Section: Impact Of Learning Objectivesmentioning
confidence: 88%
“…propose a cross-lingual visual pre-training method and fine-tuned for MMT. It is worth noting that some of previous works (Ive et al, 2019;Lin et al, 2020;Wang and Xiong, 2021;Nishihara et al, 2020; adopt regional visual information like us, which shows effectiveness compared with global visual features. The major difference between our method and theirs is that our method is a retrieval-based method, which breaks the reliance on bilingual sentence-image pairs, Therefore, our method is still applicable when the input is text only (without paired images), which is unfortunately not available with those previous methods.…”
Section: Related Workmentioning
confidence: 95%
“…);Nishihara et al (2020);Wang and Xiong (2021) propose auxiliary loss to allow the model to make better use of visual information.Caglayan et al (2019); Wu et al (2021) conduct systematic analysis to probe the contribution of visual modality. Caglayan et al (2020); Ive et al (2021) focus on improving simultaneous machine translation with visual context.…”
mentioning
confidence: 99%
“…An interesting focal point is the possibility of using text data from different languages. Multilingual cross-modal architectures have enjoyed some interest [150], with attempts to utilize unpaired samples [17]. Given the relatively limited nature of hand-made datasets and the problems with scaling supervised approaches on such data, there is significant potential in the area focusing on the move from classic paired supervised training examples to more loosely coupled data points.…”
Section: Directions For Future Researchmentioning
confidence: 99%