Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao, Li,; Fan, Jiahao; Pan, Yingwei; Yao, Ting; Lin, Weiyao; Mei, Tao

doi:10.48550/arxiv.2201.04026

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They constructed a decision fusion module to combine the outputs of Transformer modules at different granularities. To simultaneously pretrain the encoder for multi-modality representation extraction and the language decoder for sentence generation, literature [ 29 ] proposed a pretrained universal encoder-decoder network (Uni-EDEN) to facilitate visual language perception and generation. The model undergoes pretraining using multi-granularity visual language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (MSG).…”

Section: Related Workmentioning

confidence: 99%

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Sheng

2023

PLoS ONE

View full text Add to dashboard Cite

Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Sheng

2023

PLoS ONE

View full text Add to dashboard Cite

show abstract

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Cited by 1 publication

References 0 publications

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Contact Info

Product

Resources

About