2022
DOI: 10.48550/arxiv.2210.08554
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

COFAR: Commonsense and Factual Reasoning in Image Search

Abstract: present a unified framework, namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named visual entities in an image as a gateway to encyclopedic knowledge and leverages them along with natural language query to ground relevant knowledge. Further, KRAMT seamlessly integrates visual content and grounded knowledge to learn alignment between images and search queries. This unified framework is then used to perform image search requiring commonsense and factual reasoning. The retriev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 24 publications
(39 reference statements)
0
1
0
Order By: Relevance
“…WinoX [39] French, German, Russian [83] answering 150,000 questions COFAR [47] Find an image 25,300 images Expert construction matching a query 40,800 queries CoSim [74] Counterfactual 3500 instances Crowd sourcing reasoning about images CRIC [44] Compositional 96,000 images Synthesized reasoning 494,000 questions e-SNLI-VE [71] Visual-textual 430,000 Synthesized entailment from SNLI-VE FVQA [138] Visual question 2190 images Synthesized answering GD-VCR [149] Visual question 328 images Expert construction answering 886 Q/A pairs Half&Half [123] Reasoning with text 126,000 examples Synthesized and incomplete images HumanCog [151] Who in image 67,000 images Extracted from VCR is being described? 138,000 descriptions + crowd sourcing HVQR [21] Visual question 33,000 images Synthesized answering 157,000 Q/A pairs IconQA [94] Visual question 107,400 instances Crowd sourcing answering KB-VQA [137] Visual question 2190 images Synthesized answering Naive action-Match image 1400 text effects Crowd sourcing effect prediction [45] to effect of action 4163 images PTR [61] Visual question 80,000 images Synthesized (both answering 800,000 images images and Q/A pairs) Sherlock [60] Inferences from 103,000 images Crowd sourcing images 363,000 inferences VCR [155] Visual question 290,000 questions Crowd sourcing answering Visual Visual question 108,000 images Crowd sourcing Genome [78] answering WinoGAViL [13] Match image to text 4482 examples Gamification Table 8: Image benchmarks Name Task Size Construction AGENT [121] Is this surprising?…”
Section: Originalmentioning
confidence: 99%
“…WinoX [39] French, German, Russian [83] answering 150,000 questions COFAR [47] Find an image 25,300 images Expert construction matching a query 40,800 queries CoSim [74] Counterfactual 3500 instances Crowd sourcing reasoning about images CRIC [44] Compositional 96,000 images Synthesized reasoning 494,000 questions e-SNLI-VE [71] Visual-textual 430,000 Synthesized entailment from SNLI-VE FVQA [138] Visual question 2190 images Synthesized answering GD-VCR [149] Visual question 328 images Expert construction answering 886 Q/A pairs Half&Half [123] Reasoning with text 126,000 examples Synthesized and incomplete images HumanCog [151] Who in image 67,000 images Extracted from VCR is being described? 138,000 descriptions + crowd sourcing HVQR [21] Visual question 33,000 images Synthesized answering 157,000 Q/A pairs IconQA [94] Visual question 107,400 instances Crowd sourcing answering KB-VQA [137] Visual question 2190 images Synthesized answering Naive action-Match image 1400 text effects Crowd sourcing effect prediction [45] to effect of action 4163 images PTR [61] Visual question 80,000 images Synthesized (both answering 800,000 images images and Q/A pairs) Sherlock [60] Inferences from 103,000 images Crowd sourcing images 363,000 inferences VCR [155] Visual question 290,000 questions Crowd sourcing answering Visual Visual question 108,000 images Crowd sourcing Genome [78] answering WinoGAViL [13] Match image to text 4482 examples Gamification Table 8: Image benchmarks Name Task Size Construction AGENT [121] Is this surprising?…”
Section: Originalmentioning
confidence: 99%