The MeMAD Submission to the WMT18 Multimodal Translation Task

Grönroos, Stig-Arne; Huet, Benoît; Kurimo, Mikko; Laaksonen, Jorma; Mérialdo, Bernard; Pham, Phu; Sjöberg, Mats; Sulubacak, Umut; Tiedemann, Jörg; Troncy, Raphaël; Vázquez, Raúl

doi:10.18653/v1/w18-6439

Cited by 53 publications

(47 citation statements)

References 17 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The CUNI submissions use two architectures based on the self-attentive Transformer model (Vaswani et al, 2017). For German and Czech, a language model is used to extract pseudo-in-ID Participating team AFRL-OHIOSTATE Air Force Research Laboratory & Ohio State University (Gwinnup et al, 2018) CUNI Univerzita Karlova v Praze (Helcl et al, 2018) LIUMCVC Laboratoire d'Informatique de l'Université du Maine & Universitat Autonoma de Barcelona Computer Vision Center (Caglayan et al, 2018) MeMAD Aalto University, Helsinki University & EURECOM (Grönroos et al, 2018) OSU-BAIDU Oregon State University & Baidu Research (Zheng et al, 2018) SHEF University of Sheffield UMONS Université de Mons (Delbrouck and Dupont, 2018) Table 5: Participants in the WMT18 multimodal machine translation shared task.…”

Section: Cuni (Task 1)mentioning

confidence: 99%

Findings of the Third Shared Task on Multimodal Machine Translation

Barrault¹,

Bougares²,

Specia³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

125

115

View full text Add to dashboard Cite

We present the results from the third shared task on multimodal machine translation. In this task a source sentence in English is supplemented by an image and participating systems are required to generate a translation for such a sentence into German, French or Czech. The image can be used in addition to (or instead of) the source sentence. This year the task was extended with a third target language (Czech) and a new test set. In addition, a variant of this task was introduced with its own test set where the source sentence is given in multiple languages: English, French and German, and participating systems are required to generate a translation in Czech. Seven teams submitted 45 different systems to the two variants of the task. Compared to last year, the performance of the multimodal submissions improved, but text-only systems remain competitive.

show abstract

Section: Cuni (Task 1)mentioning

confidence: 99%

Findings of the Third Shared Task on Multimodal Machine Translation

Barrault¹,

Bougares²,

Specia³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

125

115

View full text Add to dashboard Cite

show abstract

“…Each image in this dataset is associated with up to 5 independently annotated English captions, with a total of 616,767 captions. Though originally a monolingual dataset, the dataset's large size makes it useful for data augmentation methods for image-guided translation, as demonstrated in Grönroos et al (2018). There has also been some effort to add other languages to COCO.…”

Section: Flickr8kmentioning

confidence: 99%

Multimodal machine translation through visuals and speech

Sulubacak

Çağlayan

Grönroos

et al. 2020

Machine Translation

Self Cite

View full text Add to dashboard Cite

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

show abstract

“…Recently, some Transformer-based multimodal NMT models have been proposed. Grönroos et al [11] added a gating layer to each output of the Transformer encoder and decoder, and their model uses visual features in the gate. They showed that the proposed gating layer in the encoder decreases ambiguity in encoding source language sentences and that in the decoder suppresses the outputs of unnecessary words.…”

Section: Previous Multimodal Nmtmentioning

confidence: 99%

Multimodal Neural Machine Translation Using CNN and Transformer Encoder

Takushima¹,

Tamura²,

Ninomiya³

et al. 2019

EasyChair Preprints

View full text Add to dashboard Cite

Multimodal machine translation uses images related to source language sentences as inputs to improve translation quality. Pre-existing multimodal neural machine translation (NMT) models that incorporate the visual features of each image region into an encoder for source language sentences or an attention mechanism between an encoder and a decoder cannot catch the relationship between the visual features from each image region. This paper proposes a new multimodal NMT model that encodes an input image using a convolutional neural network (CNN) and a Transformer encoder. In particular, our proposed image encoder extracts visual features from each image region using a CNN then encodes an input image based on the extracted visual features using a Transformer encoder, where the relationship between the visual features from each image region is captured by a self-attention mechanism of the Transformer encoder. Our experiments with English-German translation tasks using the Multi30K data set showed our proposed model improves 0.96 BLEU points against a baseline Transformer NMT model without image inputs and improves 0.47 BLEU points against a baseline multimodal Transformer NMT model without a Transformer encoder for images.

show abstract

The MeMAD Submission to the WMT18 Multimodal Translation Task

Cited by 53 publications

References 17 publications

Findings of the Third Shared Task on Multimodal Machine Translation

Findings of the Third Shared Task on Multimodal Machine Translation

Multimodal machine translation through visuals and speech

Multimodal Neural Machine Translation Using CNN and Transformer Encoder

Contact Info

Product

Resources

About