ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Yu, F. Richard; Tang, Jiji; Yin, Weichong; Sun, Yu; Tian, Hao; Wu, Hua; Wang, Haifeng

doi:10.1609/aaai.v35i4.16431

Cited by 145 publications

(56 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For multi-modal fusion modules, existing methods can be classified into two categories (i.e., single-stream and dual-stream). In specific, for the single-stream fusion, the models [8,27,28,43] use a single Transformer for early and unconstrained fusion between modalities; for the dual-stream fusion, the models [35,47,55] adopt the co-attention mechanism to interact different modalities. For pretext tasks, inspired by uni-modal pre-training schemes such as MLM [10,33] and causal language modeling [6], existing studies explore a variety of pre-training tasks, including MLM [27,35,47], MIM [8,35], ITM [27,58], image-text contrastive [26] and prefix language modeling [51].…”

Section: Related Workmentioning

confidence: 99%

“…According to the knowledge injection schemes, existing studies can be classified into four categories: embeddings combination [41,59], data structure compatibility [16,32,45], knowledge supervision [46,49], and neural-symbolic methods [2]. For VLP, knowledge can be acquired from both the image and text modalities, and there are several works [9,28,55] studying to integrate knowledge into their methods. ERNIE-ViL [55] built detailed semantic alignments between vision and language based on the scene graph parsed from the text.…”

Section: Related Workmentioning

confidence: 99%

“…Although these methods have motivated the learning of imagetext correspondences through well-designed model architectures and pretext tasks, most of them disregard the complementary information (i.e., knowledge) shared by different modalities and still lack the explicit knowledge modeling for Med-VLP. Even in the general domain, there are only a few VLP studies [7,9,28,55] on incorporating external knowledge into the pre-training process. For instance, ERNIE-ViL [55] constructed a scene graph from the input text to build the semantic connections between vision and language and emphasized the importance of keywords (e.g., objects, attributes, and relationships between objects) through the designs of pretext tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Even in the general domain, there are only a few VLP studies [7,9,28,55] on incorporating external knowledge into the pre-training process. For instance, ERNIE-ViL [55] constructed a scene graph from the input text to build the semantic connections between vision and language and emphasized the importance of keywords (e.g., objects, attributes, and relationships between objects) through the designs of pretext tasks. ROSITA [9] used a unified scene graph shared by the input image and text to enhance the semantic alignments between vision and language.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Chen

Wan

2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multimodal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training. 1 CCS CONCEPTS• Computing methodologies → Artificial intelligence; Machine learning; Multi-task learning.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Chen

Wan

2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Recurrent Neural Networks (RNNS) were proposed to capture the temporal correlation of electrophysiological signals (Michielli et al, 2019 ). Attention mechanism and attention-based feature fusion have been widely used in multimodal representation learning (Huang et al, 2019 , 2020 ; Lu et al, 2019 ; Wei et al, 2020 ; Zhang et al, 2020a , b , c ; Desai and Johnson, 2021 ; Yu et al, 2021 ; Ma et al, 2022 ). The existing studies based on attention mechanisms usually used single-modal data such as EEG or EOG, which only focused on the inter-relationship among single modality features rather than cross-modal features (Eldele et al, 2021 ).…”

Section: Introductionmentioning

confidence: 99%

MMASleepNet: A multimodal attention network based on electrophysiological signals for automatic sleep staging

et al. 2022

View full text Add to dashboard Cite

Pandemic-related sleep disorders affect human physical and mental health. The artificial intelligence (AI) based sleep staging with multimodal electrophysiological signals help people diagnose and treat sleep disorders. However, the existing AI-based methods could not capture more discriminative modalities and adaptively correlate these multimodal features. This paper introduces a multimodal attention network (MMASleepNet) to efficiently extract, perceive and fuse multimodal features of electrophysiological signals. The MMASleepNet has a multi-branch feature extraction (MBFE) module followed by an attention-based feature fusing (AFF) module. In the MBFE module, branches are designed to extract multimodal signals' temporal and spectral features. Each branch has two-stream convolutional networks with a unique kernel to perceive features of different time scales. The AFF module contains a modal-wise squeeze and excitation (SE) block to adjust the weights of modalities with more discriminative features and a Transformer encoder (TE) to generate attention matrices and extract the inter-dependencies among multimodal features. Our MMASleepNet outperforms state-of-the-art models in terms of different evaluation matrices on the datasets of Sleep-EDF and ISRUC-Sleep. The implementation code is available at: https://github.com/buptantEEG/MMASleepNet/.

show abstract

Interpretable Visual Understanding with Cognitive Attention Network

Tang

Zhang

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Cited by 145 publications

References 25 publications

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

MMASleepNet: A multimodal attention network based on electrophysiological signals for automatic sleep staging

Interpretable Visual Understanding with Cognitive Attention Network

Contact Info

Product

Resources

About