Probing the Need for Visual Context in Multimodal Machine Translation

Çağlayan, Ozan; Madhyastha, Pranava; Specia, Lucia; Barrault, Loïc

doi:10.18653/v1/n19-1422

Cited by 107 publications

(101 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They perform careful experiments by using input degradation and observe that, specially under limited textual context, multimodal models exploit the visual input to generate better translations. Caglayan et al (2019) also show that MMT systems exploit visual cues and obtain correct translations even with typographical errors in the source sentences. In this paper, we build upon this idea and investigate the potential of visual cues for refining translation.…”

Section: Related Workmentioning

confidence: 74%

“…We note that translation refinement is different translation re-ranking from a text-only model based on image representation (Shah et al, 2016;Hitschler et al, 2016;, since the latter assumes that the correct translation can already be produced by a text-only model. Caglayan et al (2019) investigate the importance and the contribution of multimodality for MMT. They perform careful experiments by using input degradation and observe that, specially under limited textual context, multimodal models exploit the visual input to generate better translations.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Distilling Translations with Visual Awareness

Ive¹,

Madhyastha²,

Specia³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making better use of the target language textual context (both left and right-side contexts) and (ii) making use of visual context. This approach leads to the state of the art results. Additionally, we show that it has the ability to recover from erroneous or missing words in the source language.

show abstract

Section: Related Workmentioning

confidence: 74%

Section: Related Workmentioning

confidence: 99%

Distilling Translations with Visual Awareness

Ive¹,

Madhyastha²,

Specia³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…This observation indicates that the model is using the visual modality as a regularization technique, and not using the semantics of the image. Similar findings in multimodal machine translation [11] led to an investigation of the utility of visual context [12]. They concluded that visual modality is helpful when the primary modality is degraded.…”

Section: Introductionmentioning

confidence: 77%

“…1. Silence Masking: We substitute the masked word with a specific value, silence, similar to the special token in [12]. In this nonrealistic scenario, the model is trained to generate the missing word when a known signal is present in the audio.…”

Section: Audio Corruptionmentioning

confidence: 99%

See 1 more Smart Citation

Looking Enhances Listening: Recovering Missing Speech Using Images

Srinivasan

Sanabria

Metze

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

show abstract