Multimodal Pivots for Image Caption Translation

Hitschler, Julian; Schamoni, Shigehiko; Riezler, Stefan

doi:10.18653/v1/p16-1227

Cited by 86 publications

(79 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 The aim of this task is to use images in addition to source languages as inputs to improve the translation performance, hopefully relaxing ambiguity in alignment that cannot be solved by texts only. The feasibility of this approach has been demonstrated by some methods, such as visual-based reranking of SMT results (Hitschler and Riezler 2016). However, this task assumes that images are available as a part of a query in the testing phase, and thus the objective and setup are entirely different from ours.…”

Section: Computer Vision For Machine Translationmentioning

confidence: 99%

Zero-resource machine translation by multimodal encoder–decoder network with multimedia pivot

Nakayama

Nishida

2017

Machine Translation

View full text Add to dashboard Cite

We propose an approach to build a neural machine translation system with no supervised resources (i.e., no parallel corpora) using multimodal embedded representation over texts and images. Based on the assumption that text documents are often likely to be described with other multimedia information (e.g., images) somewhat related to the content, we try to indirectly estimate the relevance between two languages. Using multimedia as the "pivot", we project all modalities into one common hidden space where samples belonging to similar semantic concepts should come close to each other, whatever the observed space of each sample is. This modalityagnostic representation is the key to bridging the gap between different modalities. Putting a decoder on top of it, our network can flexibly draw the outputs from any input modality. Notably, in the testing phase, we need only source language texts as the input for translation. In experiments, we tested our method on two benchmarks to show that it can achieve reasonable translation performance. We compared and investigated several possible implementations and found that an end-to-end model that simultaneously optimized both rank loss in multimodal encoders and cross-entropy loss in decoders performed the best.

show abstract

Section: Computer Vision For Machine Translationmentioning

confidence: 99%

Zero-resource machine translation by multimodal encoder–decoder network with multimedia pivot

Nakayama

Nishida

2017

Machine Translation

View full text Add to dashboard Cite

show abstract

“…Such resources currently exist with annotations in German [Elliott et al, 2016, Hitschler et al, 2016, Rajendran et al, 2016, Turkish [Unal et al, 2016], Chinese [Li et al, 2016], Japanese [Miyazaki andShimizu, 2016, Yoshikawa et al, 2017], Dutch [van Miltenburg et al, 2017], and French . Table 1 presents an overview of multilingual image description datasets.…”

Section: Multilingual Multimodal Resourcesmentioning

confidence: 99%

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

2018

View full text Add to dashboard Cite

Two studies on multilingual multimodal image description provide empirical evidence towards two hypotheses at the core of the task: (i) whether target language speakers prefer descriptions generated directly in their native language, as compared to descriptions translated from a different language; (ii) the role of the image in human translation of descriptions. These results provide guidance for future work in multimodal natural language processing by firstly showing that on the whole, translations are not distinguished from native language descriptions, and secondly delineating and quantifying the information gained from the image during the human translation task.

show abstract

“…These datasets are constructed in English and are aimed at advancing research on the generation of image descriptions in English. Recent attempts have been made to incorporate multilinguality into both these largescale datasets, with the datasets being extended to other languages such as German and Japanese Hitschler et al, 2016;Miyazaki and Shimizu, 2016;Yoshikawa et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation

Madhyastha¹,

Wang²,

Specia³

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

This paper describes the University of Sheffield's submission to the WMT17 Multimodal Machine Translation shared task. We participated in Task 1 to develop an MT system to translate an image description from English to German and French, given its corresponding image. Our proposed systems are based on the state-of-the-art Neural Machine Translation approach. We investigate the effect of replacing the commonly-used image embeddings with an estimated posterior probability prediction for 1,000 object categories in the images.

show abstract

Multimodal Pivots for Image Caption Translation

Cited by 86 publications

References 27 publications

Zero-resource machine translation by multimodal encoder–decoder network with multimedia pivot

Zero-resource machine translation by multimodal encoder–decoder network with multimedia pivot

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation

Contact Info

Product

Resources

About