Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1422
|View full text |Cite
|
Sign up to set email alerts
|

Probing the Need for Visual Context in Multimodal Machine Translation

Abstract: Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contributio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

13
81
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 107 publications
(101 citation statements)
references
References 29 publications
13
81
1
Order By: Relevance
“…They perform careful experiments by using input degradation and observe that, specially under limited textual context, multimodal models exploit the visual input to generate better translations. Caglayan et al (2019) also show that MMT systems exploit visual cues and obtain correct translations even with typographical errors in the source sentences. In this paper, we build upon this idea and investigate the potential of visual cues for refining translation.…”
Section: Related Workmentioning
confidence: 74%
See 1 more Smart Citation
“…They perform careful experiments by using input degradation and observe that, specially under limited textual context, multimodal models exploit the visual input to generate better translations. Caglayan et al (2019) also show that MMT systems exploit visual cues and obtain correct translations even with typographical errors in the source sentences. In this paper, we build upon this idea and investigate the potential of visual cues for refining translation.…”
Section: Related Workmentioning
confidence: 74%
“…We note that translation refinement is different translation re-ranking from a text-only model based on image representation (Shah et al, 2016;Hitschler et al, 2016;, since the latter assumes that the correct translation can already be produced by a text-only model. Caglayan et al (2019) investigate the importance and the contribution of multimodality for MMT. They perform careful experiments by using input degradation and observe that, specially under limited textual context, multimodal models exploit the visual input to generate better translations.…”
Section: Related Workmentioning
confidence: 99%
“…This observation indicates that the model is using the visual modality as a regularization technique, and not using the semantics of the image. Similar findings in multimodal machine translation [11] led to an investigation of the utility of visual context [12]. They concluded that visual modality is helpful when the primary modality is degraded.…”
Section: Introductionmentioning
confidence: 77%
“…1. Silence Masking: We substitute the masked word with a specific value, silence, similar to the special token in [12]. In this nonrealistic scenario, the model is trained to generate the missing word when a known signal is present in the audio.…”
Section: Audio Corruptionmentioning
confidence: 99%
See 1 more Smart Citation