Proceedings of the Second Conference on Machine Translation 2017
DOI: 10.18653/v1/w17-4746
|View full text |Cite
|
Sign up to set email alerts
|

LIUM-CVC Submissions for WMT17 Multimodal Translation Task

Abstract: This paper describes the monomodal and multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT17 Shared Task on Multimodal Translation. We mainly explored two multimodal architectures where either global visual features or convolutional feature maps are integrated in order to benefit from visual context. Our final systems ranked first for both En→De and En→Fr language pairs according to the automatic evaluation metrics METEOR and BLEU.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
85
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 79 publications
(88 citation statements)
references
References 21 publications
1
85
0
Order By: Relevance
“…The former linearly projects the concatenation of textual and visual context vectors to obtain the multimodal context vector, while the latter replaces the concatenation with another attention layer. Finally, we also experiment with encoder-decoder initialization (INIT) (Calixto and Liu, 2017;Caglayan et al, 2017a) where we initialize both the encoder and the decoder using a non-linear transformation of the pool5 features.…”
Section: Input Degradationmentioning
confidence: 99%
“…The former linearly projects the concatenation of textual and visual context vectors to obtain the multimodal context vector, while the latter replaces the concatenation with another attention layer. Finally, we also experiment with encoder-decoder initialization (INIT) (Calixto and Liu, 2017;Caglayan et al, 2017a) where we initialize both the encoder and the decoder using a non-linear transformation of the pool5 features.…”
Section: Input Degradationmentioning
confidence: 99%
“…Initializing the encoder and the decoder is an approach previously explored in multimodal machine translation [7,8]. In order to ground the speech encoder with visual context, we first introduce two non-linear layers to learn an initial hidden and cell state globally for all LSTM layers E k in the encoder:…”
Section: Tied Initialization For Encoder and Decodermentioning
confidence: 99%
“…In this paper, we first apply an adaptive training scheme [3,4,5] for sequence-to-sequence (S2S) speech recognition and then propose two novel multimodal grounding methods for S2S ASR inspired from previous work in image captioning [6] and multimodal neural machine translation (MMT) [7,8]. We compare both approaches through the use of visual features extracted from pre-trained models trained for object, scene and action recognition tasks [9,10,11].…”
Section: Introductionmentioning
confidence: 99%
“…In the unconstrained variant of the imagination experiments, the dataset consists of examples that can miss either the textual target values (MSCOCO extension), or the image (additional Table 3: Results on the 2016 test set in terms of BLEU score and METEOR score. We compare our results with the last year's best system (Caglayan et al, 2017) which used model ensembling instead of weight averaging. parallel data).…”
Section: Methodsmentioning
confidence: 99%
“…We find that with self-attentive networks, we are able to improve over a strong textual baseline by including the visual information in the model. This has been proven challenging in the previous RNN-based submissions, where there was only a minor difference in performance between textual and multimodal models Caglayan et al, 2017). This paper is organized as follows.…”
Section: Introductionmentioning
confidence: 99%