A Shared Task on Multimodal Machine Translation and Crosslingual
            Image Description

Specia, Lucia; Frank, Stella; Sima’an, Khalil; Elliott, Desmond

doi:10.18653/v1/w16-2346

Cited by 173 publications

(106 citation statements)

References 25 publications

Supporting

Mentioning

105

Contrasting

Order By: Relevance

“…At first dominated by statistical methods combining count-based translation and language models [33], the current paradigm relies upon deep neural network models [34]. New ideas continue to be introduced, including models which take advantage of shared visual context [35], but the majority of MT research has focused on the text-to-text case. Recent work has moved beyond that paradigm by implementing translation between speech audio in the source language and written text in the target language [36,37,38].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

Harwath

Chuang

Glass

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images. These embeddings are learned directly from the waveforms without the use of linguistic transcriptions or conventional speech recognition technology. While prior work has investigated this setting in the monolingual case using English speech data, this work represents the first effort to apply these techniques to languages beyond English. Using spoken captions collected in English and Hindi, we show that the same model architecture can be successfully applied to both languages. Further, we demonstrate that training a multilingual model simultaneously on both languages offers improved performance over the monolingual models. Finally, we show that these models are capable of performing semantic cross-lingual speech-to-speech retrieval.Index Terms-Vision and language, unsupervised speech processing, cross-lingual speech retrieval

show abstract

Section: Relation To Prior Workmentioning

confidence: 99%

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

Harwath

Chuang

Glass

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Multimodal machine translation (MMT) has been the subject of two largescale Shared Task evaluations at the Conference on Machine Translation [Specia et al, 2016, which we refer to as MMT16 and MMT17. These shared tasks have focused on generating descriptions of images in non-English languages, by either translating parallel text or crosslingual description using independently collected sentences.…”

Section: Evaluating Multilingual Multimodal Modelsmentioning

confidence: 99%

“…no re-training or finetuning (using the post-edited development set) was performed; only the goldstandard data was (marginally) different due to the post-edits. Table 6 shows the relative difference in system performance when evaluated using the post-edited references as compared to the original ranking [Specia et al, 2016]. The differences between performance on the two test sets are nonexistent or marginal and do not lead to any changes in the overall ranking of the systems.…”

Section: (B) English Description Inaccuratementioning

confidence: 99%

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

2018

Self Cite

View full text Add to dashboard Cite

Two studies on multilingual multimodal image description provide empirical evidence towards two hypotheses at the core of the task: (i) whether target language speakers prefer descriptions generated directly in their native language, as compared to descriptions translated from a different language; (ii) the role of the image in human translation of descriptions. These results provide guidance for future work in multimodal natural language processing by firstly showing that on the whole, translations are not distinguished from native language descriptions, and secondly delineating and quantifying the information gained from the image during the human translation task.

show abstract

“…processing (NLP) tasks as well, such as image caption [26] and some task-specific translation-sign language translation [5]. However, [23] demonstrates that most multimodal translation algorithms are not significantly better than an off-the-shelf text-only machine translation (MT) model for the Multi30K dataset [11]. There remains an open question about how translation models should take advantage of visual context, because from the perspective of information theory, the mutual information of two random variables I(X, Y ) will always be no greater than I(X; Y, Z), due to the following fact.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Multi-Modal Neural Machine Translation

Fan

Bach

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Unsupervised neural machine translation (UNMT) has recently achieved remarkable results [19] with only large monolingual corpora in each language. However, the uncertainty of associating target with source sentences makes UNMT theoretically an ill-posed problem. This work investigates the possibility of utilizing images for disambiguation to improve the performance of UNMT. Our assumption is intuitively based on the invariant property of image, i.e., the description of the same visual content by different languages should be approximately similar. We propose an unsupervised multi-modal machine translation (UMNMT) framework based on the language translation cycle consistency loss conditional on the image, targeting to learn the bidirectional multi-modal translation simultaneously. Through an alternate training between multi-modal and uni-modal, our inference model can translate with or without the image. On the widely used Multi30K dataset, the experimental results of our approach are significantly better than those of the text-only UNMT on the 2016 test dataset. * indicates equal contribution. Work performed while Yuanhang Su was an internship at Alibaba.

show abstract

A Shared Task on Multimodal Machine Translation and Crosslingual Image Description

Cited by 173 publications

References 25 publications

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

Unsupervised Multi-Modal Neural Machine Translation

Contact Info

Product

Resources

About