LIUM-CVC Submissions for WMT17 Multimodal Translation Task

Çağlayan, Ozan; Aransa, Walid; Bardet, Adrien; García-Martínez, Mercedes; Bougares, Fethi; Barrault, Loïc; Masana, Marc; Herranz, Luis; Weijer, Joost van de

doi:10.18653/v1/w17-4746

Cited by 79 publications

(88 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The former linearly projects the concatenation of textual and visual context vectors to obtain the multimodal context vector, while the latter replaces the concatenation with another attention layer. Finally, we also experiment with encoder-decoder initialization (INIT) (Calixto and Liu, 2017;Caglayan et al, 2017a) where we initialize both the encoder and the decoder using a non-linear transformation of the pool5 features.…”

Section: Input Degradationmentioning

confidence: 99%

Probing the Need for Visual Context in Multimodal Machine Translation

Çağlayan¹,

Madhyastha²,

Specia³

et al. 2019

Proceedings of the 2019 Conference of the North

Self Cite

116

View full text Add to dashboard Cite

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.

show abstract

Section: Input Degradationmentioning

confidence: 99%

Probing the Need for Visual Context in Multimodal Machine Translation

Çağlayan¹,

Madhyastha²,

Specia³

et al. 2019

Proceedings of the 2019 Conference of the North

Self Cite

116

View full text Add to dashboard Cite

show abstract

“…Initializing the encoder and the decoder is an approach previously explored in multimodal machine translation [7,8]. In order to ground the speech encoder with visual context, we first introduce two non-linear layers to learn an initial hidden and cell state globally for all LSTM layers E k in the encoder:…”

Section: Tied Initialization For Encoder and Decodermentioning

confidence: 99%

“…In this paper, we first apply an adaptive training scheme [3,4,5] for sequence-to-sequence (S2S) speech recognition and then propose two novel multimodal grounding methods for S2S ASR inspired from previous work in image captioning [6] and multimodal neural machine translation (MMT) [7,8]. We compare both approaches through the use of visual features extracted from pre-trained models trained for object, scene and action recognition tasks [9,10,11].…”

Section: Introductionmentioning

confidence: 99%

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Çağlayan

Sanabria

Palaskar

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system. This approach, however, is not end-to-end as it requires fine-tuning the whole model with an adaptation layer. In this paper, we propose novel end-to-end multimodal ASR systems and compare them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks. We show that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate. As for the end-to-end systems, although they perform better than baseline, the improvements are slightly less than adaptive training, 0.8 absolute WER reduction in singlebest models. Using ensemble decoding, end-to-end models reach a WER of 15% which is the lowest score among all systems.

show abstract

“…In the unconstrained variant of the imagination experiments, the dataset consists of examples that can miss either the textual target values (MSCOCO extension), or the image (additional Table 3: Results on the 2016 test set in terms of BLEU score and METEOR score. We compare our results with the last year's best system (Caglayan et al, 2017) which used model ensembling instead of weight averaging. parallel data).…”

Section: Methodsmentioning

confidence: 99%

“…We find that with self-attentive networks, we are able to improve over a strong textual baseline by including the visual information in the model. This has been proven challenging in the previous RNN-based submissions, where there was only a minor difference in performance between textual and multimodal models Caglayan et al, 2017). This paper is organized as follows.…”

Section: Introductionmentioning

confidence: 99%

CUNI System for the WMT18 Multimodal Translation Task

Helcl¹,

Libovický²,

Variš³

2018

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

View full text Add to dashboard Cite

We present our submission to the WMT18 Multimodal Translation Task. The main feature of our submission is applying a selfattentive network instead of a recurrent neural network. We evaluate two methods of incorporating the visual features in the model: first, we include the image representation as another input to the network; second, we train the model to predict the visual features and use it as an auxiliary objective. For our submission, we acquired both textual and multimodal additional data. Both of the proposed methods yield significant improvements over recurrent networks and self-attentive textual baselines.

show abstract

LIUM-CVC Submissions for WMT17 Multimodal Translation Task

Cited by 79 publications

References 21 publications

Probing the Need for Visual Context in Multimodal Machine Translation

Probing the Need for Visual Context in Multimodal Machine Translation

Multimodal Grounding for Sequence-to-sequence Speech Recognition

CUNI System for the WMT18 Multimodal Translation Task

Contact Info

Product

Resources

About