2022
DOI: 10.1145/3473140
|View full text |Cite
|
Sign up to set email alerts
|

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Abstract: Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 44 publications
0
3
0
Order By: Relevance
“…Deep learning (DL) [1], as one of the most popular machine learning methods driven by big data, has been widely studied and employed in various felds and different scenarios such as face detection [2], social networks [3,4], natural language process [5,6], speech technology [7][8][9], detection of network anomalies [10,11], and multimodal learning [12][13][14].…”
Section: Introductionmentioning
confidence: 99%
“…Deep learning (DL) [1], as one of the most popular machine learning methods driven by big data, has been widely studied and employed in various felds and different scenarios such as face detection [2], social networks [3,4], natural language process [5,6], speech technology [7][8][9], detection of network anomalies [10,11], and multimodal learning [12][13][14].…”
Section: Introductionmentioning
confidence: 99%
“…Recent strides in vision-language pretraining have exerted a profound impact on image captioning research [28][29][30]. Zhou et al [28] present a unified vision-language pre-training (VLP) model for image captioning, employing a Transformer network for both encoding and decoding, with pre-training on large image-text pairs.…”
Section: Vision-language Pre-training Advancementsmentioning
confidence: 99%
“…This novel approach, leveraging textual augmentation, demonstrates improved performance in various vision-language tasks, notably in image captioning, by refining representation quality and model convergence. Li et al [30] introduce Uni-EDEN, a Universal Encoder-Decoder Network for vision-language tasks, focusing on multi-granular vision-language pre-training. This approach notably enhances multimodal reasoning and language modeling capabilities, advancing both perception and generation aspects in image captioning.…”
Section: Vision-language Pre-training Advancementsmentioning
confidence: 99%
“…Owing to successful applications of pre-training methods in NLP [7,43] and CV [5,22], more and more researchers attempt to explore this "Pre-training & Finetuning" paradigm in the video-text field [25,33], which has achieved remarkable performance gain in various downstream video understanding tasks, such as video-text re- trieval [10,38,53], video question answering [44,55,59], and video reasoning [6,15,42,54,57]. There are two mainstream paradigms in current video-text pre-training methods: the feature-level paradigm and the pixel-level one.…”
Section: Introductionmentioning
confidence: 99%