2021
DOI: 10.1609/aaai.v35i4.16431
|View full text |Cite
|
Sign up to set email alerts
|

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Abstract: We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Predi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
43
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 145 publications
(56 citation statements)
references
References 25 publications
0
43
0
Order By: Relevance
“…For multi-modal fusion modules, existing methods can be classified into two categories (i.e., single-stream and dual-stream). In specific, for the single-stream fusion, the models [8,27,28,43] use a single Transformer for early and unconstrained fusion between modalities; for the dual-stream fusion, the models [35,47,55] adopt the co-attention mechanism to interact different modalities. For pretext tasks, inspired by uni-modal pre-training schemes such as MLM [10,33] and causal language modeling [6], existing studies explore a variety of pre-training tasks, including MLM [27,35,47], MIM [8,35], ITM [27,58], image-text contrastive [26] and prefix language modeling [51].…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…For multi-modal fusion modules, existing methods can be classified into two categories (i.e., single-stream and dual-stream). In specific, for the single-stream fusion, the models [8,27,28,43] use a single Transformer for early and unconstrained fusion between modalities; for the dual-stream fusion, the models [35,47,55] adopt the co-attention mechanism to interact different modalities. For pretext tasks, inspired by uni-modal pre-training schemes such as MLM [10,33] and causal language modeling [6], existing studies explore a variety of pre-training tasks, including MLM [27,35,47], MIM [8,35], ITM [27,58], image-text contrastive [26] and prefix language modeling [51].…”
Section: Related Workmentioning
confidence: 99%
“…According to the knowledge injection schemes, existing studies can be classified into four categories: embeddings combination [41,59], data structure compatibility [16,32,45], knowledge supervision [46,49], and neural-symbolic methods [2]. For VLP, knowledge can be acquired from both the image and text modalities, and there are several works [9,28,55] studying to integrate knowledge into their methods. ERNIE-ViL [55] built detailed semantic alignments between vision and language based on the scene graph parsed from the text.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recurrent Neural Networks (RNNS) were proposed to capture the temporal correlation of electrophysiological signals (Michielli et al, 2019 ). Attention mechanism and attention-based feature fusion have been widely used in multimodal representation learning (Huang et al, 2019 , 2020 ; Lu et al, 2019 ; Wei et al, 2020 ; Zhang et al, 2020a , b , c ; Desai and Johnson, 2021 ; Yu et al, 2021 ; Ma et al, 2022 ). The existing studies based on attention mechanisms usually used single-modal data such as EEG or EOG, which only focused on the inter-relationship among single modality features rather than cross-modal features (Eldele et al, 2021 ).…”
Section: Introductionmentioning
confidence: 99%