Sheffield MultiMT: Using Object Posterior Predictions for Multimodal
            Machine Translation

Madhyastha, Pranava; Wang, Josiah; Specia, Lucia

doi:10.18653/v1/w17-4752

Cited by 15 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The image representation is integrated into the MT models by initialising the encoder or decoder (Elliott et al, 2015;Caglayan et al, 2017;Madhyastha et al, 2017); element-wise multiplication with the source word annotations (Caglayan et al, 2017); or projecting the image representation and encoder context to a common space to initialise the decoder . Elliott and Kádár (2017) and Helcl et al (2018) instead model the source sentence and reconstruct the image representation jointly via multi-task learning.…”

Section: Related Workmentioning

confidence: 99%

“…Initial approaches use RNN-based sequence to sequence models (Bahdanau et al, 2015) enhanced with a single, global image vector, extracted as one of the layers of a CNN trained for object classification (He et al, 2016), often the penultimate or final layer. The image representation is integrated into the MT models by initialising the encoder or decoder (Elliott et al, 2015;Caglayan et al, 2017;Madhyastha et al, 2017); element-wise multiplication with the source word annotations (Caglayan et al, 2017); or projecting the image representation and encoder context to a common space to initialise the decoder . Elliott and Kádár (2017) and Helcl et al (2018) instead model the source sentence and reconstruct the image representation jointly via multi-task learning.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Distilling Translations with Visual Awareness

Ive¹,

Madhyastha²,

Specia³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making better use of the target language textual context (both left and right-side contexts) and (ii) making use of visual context. This approach leads to the state of the art results. Additionally, we show that it has the ability to recover from erroneous or missing words in the source language.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Distilling Translations with Visual Awareness

Ive¹,

Madhyastha²,

Specia³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our proposed model, even though it is textual, produced competitive results with other multimodal models. The mixture-of-experts model outperformed several multimodal models, including another WMT submission [29]- [32]. Even in the out-of-domain dataset of COCO 2017, the mixture-of-experts model also performed reasonably well with a 28.0 BLEU score.…”

Section: Model Specification and Implementation Detailsmentioning

confidence: 89%

Leveraging Neural Caption Translation with Visually Grounded Paraphrase Augmentation

Effendi

Sakti

Sudoh

et al. 2020

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Since a concept can be represented by different vocabularies, styles, and levels of detail, a translation task resembles a many-to-many mapping task from a distribution of sentences in the source language into a distribution of sentences in the target language. This viewpoint, however, is not fully implemented in current neural machine translation (NMT), which is one-to-one sentence mapping. In this study, we represent the distribution itself as multiple paraphrase sentences, which will enrich the model context understanding and trigger it to produce numerous hypotheses. We use a visually grounded paraphrase (VGP), which uses images as a constraint of the concept in paraphrasing, to guarantee that the created paraphrases are within the intended distribution. In this way, our method can also be considered as incorporating image information into NMT without using the image itself. We implement this idea by crowdsourcing a paraphrasing corpus that realizes VGP and construct neural paraphrasing that behaves as expert models in a NMT. Our experimental results reveal that our proposed VGP augmentation strategies showed improvement against a vanilla NMT baseline.

show abstract

“…Later initialisation variants are applied to attentive NMTs: Calixto et al (2016) and Libovický et al (2016) experiment with recurrent decoder initialisation while Ma et al (2017) initialise both the encoder and the decoder, with features from a state-of-the-art ResNet (He et al 2016). Madhyastha et al (2017) explore the expressiveness of the posterior probability vector as a visual representation, rather than the pooled features from the penultimate layer of a CNN.…”

Section: Sequence-to-sequence Grounding With Pooled Featuresmentioning

confidence: 99%

Multimodal machine translation through visuals and speech

Sulubacak

Çağlayan

Grönroos

et al. 2020

Machine Translation

Self Cite

View full text Add to dashboard Cite

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

show abstract

Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation

Cited by 15 publications

References 20 publications

Distilling Translations with Visual Awareness

Distilling Translations with Visual Awareness

Leveraging Neural Caption Translation with Visually Grounded Paraphrase Augmentation

Multimodal machine translation through visuals and speech

Contact Info

Product

Resources

About