Multi30K: Multilingual English-German Image Descriptions

Elliott, Desmond; Frank, Stella; Sima’an, Khalil; Specia, Lucia

doi:10.18653/v1/w16-3210

Cited by 320 publications

(281 citation statements)

References 12 publications

Supporting

Mentioning

279

Contrasting

Order By: Relevance

“…We extend our experiments to the Multi30K data set, which is built on the Flickr30K data set (Young et al, 2014) and consists of English, German (Elliott et al, 2016), and French (Elliott et al, 2017) captions. For Multi30K, there are 29,000 images in the training set, 1,014 in the development set and 1,000 in the test set.…”

Section: Extension To Multiple Languagesmentioning

confidence: 99%

Visually Grounded Neural Syntax Acquisition

Shi¹,

Mao²,

Gimpel³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define the concreteness of constituents by their matching scores with images, and use it to guide the parsing of text. Experiments on the MSCOCO data set show that VG-NSL outperforms various unsupervised parsing approaches that do not use visual grounding, in terms of F 1 scores against gold parse trees. We find that VG-NSL is much more stable with respect to the choice of random initialization and the amount of training data. We also find that the concreteness acquired by VG-NSL correlates well with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that our model consistently outperforms prior unsupervised approaches.

show abstract

Section: Extension To Multiple Languagesmentioning

confidence: 99%

Visually Grounded Neural Syntax Acquisition

Shi¹,

Mao²,

Gimpel³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…With this goal, the dataset also serves in WAT 2019 8 shared task on multi-modal translation. 9 We illustrated that the text-only information in the surrounding words could be sufficient for the disambiguation. One interesting research direction would be thus to ignore all the surrounding words and simply ask: given the image, what is the correct Hindi translation of this ambiguous English word.…”

Section: Discussionmentioning

confidence: 90%

“…for resolving ambiguity due to different senses of words in different contexts. One of the starting points is "Flickr30k" [9], a multilingual (English-German, English-French, and English-Czech) shared task based on multimodal translation was part of WMT 2018 [10]. [11] proposed a multimodal NMT system using image feature for Hindi-English language pair.…”

Section: Related Workmentioning

confidence: 99%

Hindi Visual Genome: A Dataset for Multi-Modal English to Hindi Machine Translation

Parida¹,

Bojar²,

Dash³

2019

CyS

View full text Add to dashboard Cite

Visual Genome is a dataset connecting structured image information with English language. We present "Hindi Visual Genome", a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing which took the associated images into account. We prepared a set of 31525 segments, accompanied by a challenge test set of 1400 segments. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Our dataset is the first for multimodal English-Hindi machine translation, freely available for noncommercial research purposes. Our Hindi version of Visual Genome also allows to create Hindi image labelers or other practical tools. Hindi Visual Genome also serves in Workshop on Asian Translation (WAT) 2019 Multi-Modal Translation Task.

show abstract

“…As test data, set of 1,000 tuples containing an English description and its corresponding image was provided. More details about the shared task data can be found in (Elliott et al, 2016).…”

Section: Datamentioning

confidence: 99%

SHEF-Multimodal: Grounding Machine Translation on Images

Shah

Wang²,

Specia³

2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Self Cite

View full text Add to dashboard Cite

This paper describes the University of Sheffield's submission for the WMT16 Multimodal Machine Translation shared task, where we participated in Task 1 to develop German-to-English and Englishto-German statistical machine translation (SMT) systems in the domain of image descriptions. Our proposed systems are standard phrase-based SMT systems based on the Moses decoder, trained only on the provided data. We investigate how image features can be used to re-rank the n-best list produced by the SMT model, with the aim of improving performance by grounding the translations on images. Our submissions are able to outperform the strong, text-only baseline system for both directions.

show abstract

Multi30K: Multilingual English-German Image Descriptions

Cited by 320 publications

References 12 publications

Visually Grounded Neural Syntax Acquisition

Visually Grounded Neural Syntax Acquisition

Hindi Visual Genome: A Dataset for Multi-Modal English to Hindi Machine Translation

SHEF-Multimodal: Grounding Machine Translation on Images

Contact Info

Product

Resources

About