2018
DOI: 10.1109/tpami.2017.2754246
|View full text |Cite
|
Sign up to set email alerts
|

FVQA: Fact-Based Visual Question Answering

Abstract: Visual Question Answering (VQA) has attracted much attention in both computer vision and natural language processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions wh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
288
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 362 publications
(312 citation statements)
references
References 46 publications
(76 reference statements)
2
288
0
Order By: Relevance
“…We trade off size in this case for knowledge and difficulty. We can see from the average question lengths and average answer lengths that our questions and answers are about comparable to KB-VQA [43] and FVQA [44] and longer than the other VQA datasets with the exception of DAQUAR and CLEVR (which are partially and fully automated from templates respectively). This makes sense since we would expect knowledge-based questions to be longer as they are typically not able to be as short as common questions in other datasets such as "How many objects are in the image?"…”
Section: Knowledge Categoriesmentioning
confidence: 83%
See 2 more Smart Citations
“…We trade off size in this case for knowledge and difficulty. We can see from the average question lengths and average answer lengths that our questions and answers are about comparable to KB-VQA [43] and FVQA [44] and longer than the other VQA datasets with the exception of DAQUAR and CLEVR (which are partially and fully automated from templates respectively). This makes sense since we would expect knowledge-based questions to be longer as they are typically not able to be as short as common questions in other datasets such as "How many objects are in the image?"…”
Section: Knowledge Categoriesmentioning
confidence: 83%
“…In the top section, we look at a number of datasets which do not explicitly try to include a knowledge component including the ubiquitous VQAv2 dataset [16], the first version of which was one of the first datasets to investigate visual question answering. Compared to these datasets, we have a comparable number of questions to DAQUAR [32] as well as MovieQA [41], and many more questions than knowledge-based datasets KB-VQA [43] and FVQA [44]. We have fewer questions compared to CLEVR [22] where the images, questions and answers are automatically generated, as well compared to more large-scale human annotated visual datasets such as VQAv2 [16], and Visual Figure 3: Breakdown of questions in terms of knowledge categories.…”
Section: Dataset Statisticsmentioning
confidence: 99%
See 1 more Smart Citation
“…Visual7w [12] is a dataset with the goal of providing semantic links between textual descriptions and image regions by means of object-level grounding. FVQA [13] primarily contains questions that require external information to answer.…”
Section: Related Work a Vqa Datasetsmentioning
confidence: 99%
“…Several researchers employed commonsense knowledge to enrich high-level understanding tasks such as visual ques- Figure 2: (a) Example of questions that require explicit external knowledge [35], (b) Example where knowledge helps [37]. (c) Ways to integrate background knowledge: i) Pre-process knowledge and augment input [1]; ii) Incorporate knowledge as embeddings [36]; iii) Post-processing using explicit reasoning mechanism [2]; iv) Using knowledge graph to influence NN architecture [24].…”
Section: High-level Common-sense Knowledgementioning
confidence: 99%