Unsupervised Multi-Modal Neural Machine Translation

Su, Yuanhang; Fan, Kai; Bach, Nguyen; Kuo, C.-C. Jay; Huang, Fei

doi:10.1109/cvpr.2019.01073

Cited by 50 publications

(36 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Elliott and Kádár (2017) and Helcl et al (2018) investigate visually grounded representations to improve supervised multimodal machine translation, and ignore input images at test time. Using reinforcement learning, Chen et al (2018) jointly optimizes a captioner and a neural machine translator to achieve unsupervised multimodal machine translation, while Su et al (2019) and Huang et al (2020) explore transformers (Vaswani et al, 2017) to construct a text encoder-decoder for the same goal. Our work is different from referred multimodal machine translation works since our work starts from multilingual image captioning and is applied to machine translation, while some of the other methods start from a multimodal machine translation and are applied to machine translation, however building models that take advantage from these two tasks is a possible avenue for future work.…”

Section: Related Workmentioning

confidence: 99%

“…Many of previous methods rely on pre-training on external data for either captioning or machine translation and finetune models using task 1 data from Multi30k, while we rely on only the provided task 2 data from Multi30k. For example, Su et al (2019) and Huang et al (2020) both utilize WMT News Crawl datasets to pre-train machine translation models.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Using Visual Feature Space as a Pivot Across Languages

Yang¹,

Pinto-Alva²,

Dernoncourt

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Our work aims to leverage visual feature space to pass information across languages. We show that models trained to generate textual captions in more than one language conditioned on an input image can leverage their jointly trained feature space during inference to pivot across languages. We particularly demonstrate improved quality on a caption generated from an input image, by leveraging a caption in a second language. More importantly, we demonstrate that even without conditioning on any visual input, the model demonstrates to have learned implicitly to perform to some extent machine translation from one language to another in their shared visual feature space. We show results in German-English, and Japanese-English language pairs that pave the way for using the visual world to learn a common representation for language.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Using Visual Feature Space as a Pivot Across Languages

Yang¹,

Pinto-Alva²,

Dernoncourt

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…For example, Calixto, Rios, and Aziz (2019) sets a latent variable as a stochastic embedding which is used in the target-language decoder and to predict visual features. Chen, Jin, and Fu (2019) present a progressive learning approach for image pivoted zero-resource machine translation and Su et al (2019) investigate the possibility of utilizing images for disambiguation to improve the performance of unsupervised machine translation.…”

Section: Related Workmentioning

confidence: 99%

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Yang

Chen

Zhang

et al. 2020

AAAI

View full text Add to dashboard Cite

Multi-modal machine translation aims at translating the source sentence into a different language in the presence of the paired image. Previous work suggests that additional visual information only provides dispensable help to translation, which is needed in several very special cases such as translating ambiguous words. To make better use of visual information, this work presents visual agreement regularized training. The proposed approach jointly trains the source-to-target and target-to-source translation models and encourages them to share the same focus on the visual information when generating semantically equivalent visual words (e.g. “ball” in English and “ballon” in French). Besides, a simple yet effective multi-head co-attention model is also introduced to capture interactions between visual and textual features. The results show that our approaches can outperform competitive baselines by a large margin on the Multi30k dataset. Further analysis demonstrates that the proposed regularized training can effectively improve the agreement of attention on the image, leading to better use of visual information.

show abstract

“…Using visual content for unsupervised MT Su et al, 2019) is a promising solution for pivoting and alignment based on its availability and feasibility. Abundant multimodal content in various languages are available online (e.g.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

Huang¹,

Hu²,

Chang³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only. However, it is still challenging to associate source-target sentences in the latent space. As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising yet under-explored in unsupervised multimodal MT (MMT). In this paper, we investigate how to utilize visual content for disambiguation and promoting latent space alignment in unsupervised MMT. Our model employs multimodal back-translation and features pseudo visual pivoting in which we learn a shared multilingual visual-semantic embedding space and incorporate visuallypivoted captioning as additional weak supervision. The experimental results on the widely used Multi30K dataset show that the proposed model significantly improves over the state-ofthe-art methods and generalizes well when images are not available at the testing time.

show abstract

Unsupervised Multi-Modal Neural Machine Translation

Cited by 50 publications

References 23 publications

Using Visual Feature Space as a Pivot Across Languages

Using Visual Feature Space as a Pivot Across Languages

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

Contact Info

Product

Resources

About