Using Visual Feature Space as a Pivot Across Languages

Yang, Ziyan; Pinto-Alva, Leticia; Dernoncourt, Franck; Ordóñez, Vicente

doi:10.18653/v1/2020.findings-emnlp.328

Cited by 4 publications

(8 citation statements)

References 25 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The variable γ is initialized as γ = f 1 ( ϵ ) where ϵ is a dummy input image where every pixel value is sampled from a uniform distribution

. This extends our earlier reported method ( Yang et al, 2020 ) and demonstrates the validity of this type of approach in the trilingual scenario where one set of language pairs is not represented in the training annotations. We use a held-out set to determine the optimal amount of updates in the optimization process in Eq.…”

Section: Methodssupporting

confidence: 85%

“…For instance, multimodal machine translation is most effective when images are provided on top of parallel text where the images enhance the traditional machine translation corpora, and a second limitation is that translation models are still required for every language pair even if there is a single common visual representation. The present work significantly extends our prior work on backpropagation-based decoding ( Yang et al, 2020 ) using LSTMs for language pairs. Instead, we adapt transformer-based decoders for language triplets and beyond.…”

Section: Introductionmentioning

confidence: 61%

“…Since English captions and Japanese captions are created independently, the German captions translated from English captions are also independent from Japanese captions. Besides machine translation, we can also get better image captions for target languages as conducted in Yang et al (2020) by conditioning on the source language and a given input image I . The only difference is during validation and test time, where γ is initialized as γ = f 1 ( I ) instead of using random noise as input to f 1 .…”

Section: Methodsmentioning

confidence: 99%

“…In the past few years, there have been several efforts in taking advantage of images to discover and enhance connections across different languages ( Gella et al, 2017 ; Nakayama and Nishida, 2017 ; Elliott and Kádár, 2017 ). While some works have exploited alignments at the word-level ( Bergsma and Van Durme, 2011 ; Hewitt et al, 2018 ), recent work has moved forward to finding alignments between complex sentences ( Barrault et al, 2018 ; Surís et al, 2020 ; Sigurdsson et al, 2020 ; Yang et al, 2020 ).…”

Section: Introductionmentioning

confidence: 99%

“…Once this model is pre-trained, we use energy-based decoding that relies on the backpropagation algorithm to use the output from a source language decoder as additional input to generate a sentence in a target language decoder. Backpropagation as a decoding mechanism has been used in some recent language generation work ( Qin et al, 2020 ; Yang et al, 2020 ).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Backpropagation-Based Decoding for Multimodal Machine Translation

Yang

Pinto-Alva

Dernoncourt

et al. 2022

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

People are able to describe images using thousands of languages, but languages share only one visual world. The aim of this work is to use the learned intermediate visual representations from a deep convolutional neural network to transfer information across languages for which paired data is not available in any form. Our work proposes using backpropagation-based decoding coupled with transformer-based multilingual-multimodal language models in order to obtain translations between any languages used during training. We particularly show the capabilities of this approach in the translation of German-Japanese and Japanese-German sentence pairs, given a training data of images freely associated with text in English, German, and Japanese but for which no single image contains annotations in both Japanese and German. Moreover, we demonstrate that our approach is also generally useful in the multilingual image captioning task when sentences in a second language are available at test time. The results of our method also compare favorably in the Multi30k dataset against recently proposed methods that are also aiming to leverage images as an intermediate source of translations.

show abstract

“…The variable γ is initialized as γ = f 1 ( ϵ ) where ϵ is a dummy input image where every pixel value is sampled from a uniform distribution

Section: Methodssupporting

confidence: 85%

Section: Introductionmentioning

confidence: 61%

Section: Methodsmentioning

confidence: 99%