Visual Agreement Regularized Training for Multi-Modal Machine Translation

Yang, Pengcheng; Chen, Boxing; Zhang, Pei; Sun, Xu

doi:10.1609/aaai.v34i05.6484

Cited by 25 publications

(10 citation statements)

References 18 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, visual information may only be needed in particular cases. In [164][165][166][167][168], some methods are committed to solving this problem. Huang et al [169] try to explore the possibility of unsupervised learning with shared visual features in different languages.…”

Section: Multimodal Machine Translationmentioning

confidence: 99%

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Chai

Wang

2022

Applied Sciences

View full text Add to dashboard Cite

Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.

show abstract

Section: Multimodal Machine Translationmentioning

confidence: 99%

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Chai

Wang

2022

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…To avoid noise of irrelevant information in the image, Delbrouck and Dupont [12] propose a gating mechanism to weight the importance of the textual and image contexts. Recently, more works [22,27,39,42] propose to represent the image with a set of object features via object detection, considering the strong correspondences between image objects and noun entities in the source sentence. Yang et al [39] propose to jointly train sourceto-target and target-to-source translation models to encourage the model to share the same focus on object regions by visual agreement regularization.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, more works [22,27,39,42] propose to represent the image with a set of object features via object detection, considering the strong correspondences between image objects and noun entities in the source sentence. Yang et al [39] propose to jointly train sourceto-target and target-to-source translation models to encourage the model to share the same focus on object regions by visual agreement regularization. Yin et al [42] propose to construct a multi-modal graph with image objects and source words according to an external visual grounding model for the alignment of multi-modal nodes.…”

Section: Related Workmentioning

confidence: 99%

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Song

Chen²,

Jin

et al. 2021

Preprint

View full text Add to dashboard Cite

Translating e-commercial product descriptions, a.k.a product-oriented machine translation (PMT), is essential to serve e-shoppers all over the world. However, due to the domain specialty, the PMT task is more challenging than traditional machine translation problems. Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image. Secondly, product descriptions are related to the image in more complicated ways than standard image descriptions, involving various visual aspects such as objects, shapes, colors or even subjective styles. Moreover, existing PMT datasets are small in scale to support the research. In this paper, we first construct a large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images. To effectively learn semantic alignments among product images and bilingual texts in translation, we design a unified product-oriented cross-modal cross-lingual model (UPOC 2 ) for pre-training and fine-tuning. Experiments on the Fashion-MMT and Multi30k datasets show that our model significantly outperforms the state-of-the-art models even pre-trained on the same dataset. It is also shown to benefit more from large-scale noisy data to improve the translation quality. We will release the dataset and codes at https://github.com/syuqings/Fashion-MMT.

show abstract

“…Hence, MMT exhibits pronounced reliance on language-vision/speech interaction. 2 However, effectively integrating visual information and language-vision interaction into machine translation has been regarded as a big challenge (Yang et al 2020) for years since Multi30K (Elliott et al 2016) is proposed as a benchmark dataset for MMT. Many previous MMT studies on Multi30K, which exploit complete source texts during both training and inference, have found that visual context is needed only in special cases, e.g., translating sentences with incorrect or ambiguous source words, by both human and machine translation, and is hence marginally beneficial to multimodal machine translation (Lala et al 2018;Ive, Madhyastha, and Specia 2019).…”

Section: Introductionmentioning

confidence: 99%

Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding

Wang

Xiong

2021

AAAI

View full text Add to dashboard Cite

Visual context provides grounding information for multimodal machine translation (MMT). However, previous MMT models and probing studies on visual features suggest that visual information is less explored in MMT as it is often redundant to textual information. In this paper, we propose an Object-level Visual Context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation. With detected objects, the proposed OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality. We equip the proposed with an additional object-masking loss to achieve this goal. The object-masking loss is estimated according to the similarity between masked objects and the source texts so as to encourage masking source-irrelevant objects. Additionally, in order to generate vision-consistent target words, we further propose a vision-weighted translation loss for OVC. Experiments on MMT datasets demonstrate that the proposed OVC model outperforms state-of-the-art MMT models and analyses show that masking irrelevant objects helps grounding in MMT.

show abstract

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Cited by 25 publications

References 18 publications

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding

Contact Info

Product

Resources

About