2020
DOI: 10.1609/aaai.v34i05.6484
|View full text |Cite
|
Sign up to set email alerts
|

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Abstract: Multi-modal machine translation aims at translating the source sentence into a different language in the presence of the paired image. Previous work suggests that additional visual information only provides dispensable help to translation, which is needed in several very special cases such as translating ambiguous words. To make better use of visual information, this work presents visual agreement regularized training. The proposed approach jointly trains the source-to-target and target-to-source translation m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(10 citation statements)
references
References 18 publications
(22 reference statements)
0
8
0
Order By: Relevance
“…However, visual information may only be needed in particular cases. In [164][165][166][167][168], some methods are committed to solving this problem. Huang et al [169] try to explore the possibility of unsupervised learning with shared visual features in different languages.…”
Section: Multimodal Machine Translationmentioning
confidence: 99%
“…However, visual information may only be needed in particular cases. In [164][165][166][167][168], some methods are committed to solving this problem. Huang et al [169] try to explore the possibility of unsupervised learning with shared visual features in different languages.…”
Section: Multimodal Machine Translationmentioning
confidence: 99%
“…To avoid noise of irrelevant information in the image, Delbrouck and Dupont [12] propose a gating mechanism to weight the importance of the textual and image contexts. Recently, more works [22,27,39,42] propose to represent the image with a set of object features via object detection, considering the strong correspondences between image objects and noun entities in the source sentence. Yang et al [39] propose to jointly train sourceto-target and target-to-source translation models to encourage the model to share the same focus on object regions by visual agreement regularization.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, more works [22,27,39,42] propose to represent the image with a set of object features via object detection, considering the strong correspondences between image objects and noun entities in the source sentence. Yang et al [39] propose to jointly train sourceto-target and target-to-source translation models to encourage the model to share the same focus on object regions by visual agreement regularization. Yin et al [42] propose to construct a multi-modal graph with image objects and source words according to an external visual grounding model for the alignment of multi-modal nodes.…”
Section: Related Workmentioning
confidence: 99%
“…Hence, MMT exhibits pronounced reliance on language-vision/speech interaction. 2 However, effectively integrating visual information and language-vision interaction into machine translation has been regarded as a big challenge (Yang et al 2020) for years since Multi30K (Elliott et al 2016) is proposed as a benchmark dataset for MMT. Many previous MMT studies on Multi30K, which exploit complete source texts during both training and inference, have found that visual context is needed only in special cases, e.g., translating sentences with incorrect or ambiguous source words, by both human and machine translation, and is hence marginally beneficial to multimodal machine translation (Lala et al 2018;Ive, Madhyastha, and Specia 2019).…”
Section: Introductionmentioning
confidence: 99%