Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.44
|View full text |Cite
|
Sign up to set email alerts
|

KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

Abstract: We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformerbased sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We adapt the generative BART architecture (Lewis et al., 2020) to a multimodal model with visual and textual inputs. We further develop novel pretraining tasks to improve the model performance on the Visual Commonsense Generation (VCG) task. In particular, our pretraining task of Knowledge-based Commonsense G… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(26 citation statements)
references
References 23 publications
0
15
0
Order By: Relevance
“…Their proposed approach has a better understanding of noise and can handle and understand complex queries. The authors of [ 79 ] performed visual common sense generation and called it Knowledge Enhanced Multimodal BART. The authors of [ 80 ] evaluated BART for knowledge grounded conversation tasks and achieved good results.…”
Section: Types Of Classification Algorithmmentioning
confidence: 99%
“…Their proposed approach has a better understanding of noise and can handle and understand complex queries. The authors of [ 79 ] performed visual common sense generation and called it Knowledge Enhanced Multimodal BART. The authors of [ 80 ] evaluated BART for knowledge grounded conversation tasks and achieved good results.…”
Section: Types Of Classification Algorithmmentioning
confidence: 99%
“…Correspondingly, many general pre-training tasks are proposed, such as Masked Language Modeling (MLM), Masked Region Modeling (MRM) and Image-Text Matching (ITM) Yu et al, 2021). Besides, in order to make the pre-trained models better understand downstream tasks, researchers also design task-specific pre-training models for different downstream tasks (Hao et al, 2020;Xing et al, 2021). In our work, apart from the popular general pre-training tasks, we also design three kinds of task-specific pre-training tasks for the MABSA task.…”
Section: Related Workmentioning
confidence: 99%
“…Masked Region Modeling (MRM). Following Xing et al (2021), our MRM task aims to predict the semantic class distribution of the masked region. As shown in Figure 1, for the input of the encoder, we randomly mask image regions with a probability of 15%, which are replaced with zero vectors.…”
Section: Visual Pre-trainingmentioning
confidence: 99%
“…This again suggests that the model is incapable of understanding complex relatinships between vulnerable communities and ideas. A future interesting research avenue would explore methods incorporating relevant knowledge bases, similar to recent work on common sense generation (Xing et al, 2021), into transformer models to address these errors.…”
Section: False Positives and False Negativesmentioning
confidence: 99%