Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.328
|View full text |Cite
|
Sign up to set email alerts
|

Using Visual Feature Space as a Pivot Across Languages

Abstract: Our work aims to leverage visual feature space to pass information across languages. We show that models trained to generate textual captions in more than one language conditioned on an input image can leverage their jointly trained feature space during inference to pivot across languages. We particularly demonstrate improved quality on a caption generated from an input image, by leveraging a caption in a second language. More importantly, we demonstrate that even without conditioning on any visual input, the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 25 publications
(26 reference statements)
1
7
0
Order By: Relevance
“…The variable γ is initialized as γ = f 1 ( ϵ ) where ϵ is a dummy input image where every pixel value is sampled from a uniform distribution . This extends our earlier reported method ( Yang et al, 2020 ) and demonstrates the validity of this type of approach in the trilingual scenario where one set of language pairs is not represented in the training annotations. We use a held-out set to determine the optimal amount of updates in the optimization process in Eq.…”
Section: Methodssupporting
confidence: 85%
See 4 more Smart Citations
“…The variable γ is initialized as γ = f 1 ( ϵ ) where ϵ is a dummy input image where every pixel value is sampled from a uniform distribution . This extends our earlier reported method ( Yang et al, 2020 ) and demonstrates the validity of this type of approach in the trilingual scenario where one set of language pairs is not represented in the training annotations. We use a held-out set to determine the optimal amount of updates in the optimization process in Eq.…”
Section: Methodssupporting
confidence: 85%
“…For instance, multimodal machine translation is most effective when images are provided on top of parallel text where the images enhance the traditional machine translation corpora, and a second limitation is that translation models are still required for every language pair even if there is a single common visual representation. The present work significantly extends our prior work on backpropagation-based decoding ( Yang et al, 2020 ) using LSTMs for language pairs. Instead, we adapt transformer-based decoders for language triplets and beyond.…”
Section: Introductionmentioning
confidence: 61%
See 3 more Smart Citations