2016
DOI: 10.48550/arxiv.1602.07332
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
195
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 187 publications
(201 citation statements)
references
References 0 publications
0
195
0
Order By: Relevance
“…The proposed method can be improved in various ways. For one, RWFNs can be employed in tasks that should extract structural knowledge from images as well as text, such as visual question-answering using Visual Genome dataset (Krishna et al 2016). Moreover, other perspectives from neuroscience may lead to biologically plausible learning algorithms that might apply to further optimizations of RWFNs (Krotov and Hopfield 2019;Kasai et al 2021;Kappel et al 2018).…”
Section: Discussionmentioning
confidence: 99%
“…The proposed method can be improved in various ways. For one, RWFNs can be employed in tasks that should extract structural knowledge from images as well as text, such as visual question-answering using Visual Genome dataset (Krishna et al 2016). Moreover, other perspectives from neuroscience may lead to biologically plausible learning algorithms that might apply to further optimizations of RWFNs (Krotov and Hopfield 2019;Kasai et al 2021;Kappel et al 2018).…”
Section: Discussionmentioning
confidence: 99%
“…Visual Question Answering (VQA) The conventional visual question answering (VQA) task aims to answer questions pertaining to a given image. Multiple VQA datasets have been proposed, such as Visual Genome QA [25] VQA [2], GQA [16], CLEVR [22], MovieQA [53] and so on. Many works have shown state-of-the-art performance on VQA tasks, including task-specific VQA models with various cross-modality fusion mechanisms [13,20,24,49,62,66,67] and joint vision-language models that are pretrained on large-scale vision-language corpus and finetuned on VQA tasks [6,11,29,30,33,52,68].…”
Section: Related Workmentioning
confidence: 99%
“…Object detectors, such as Faster R-CNN (Ren et al, 2015), Bottom-Up and Top-Down Attention (BUTD) (Anderson et al, 2018), are trained on image annotations of common objects, e.g. COCO (Lin et al, 2014) (100K images) and Visual Genome (Krishna et al, 2016) (100K). VinVL has achieved SoTA performances on many V+L tasks by utilizing a powerful object detector pre-trained with a very large collection of image annotations (2.5M images).…”
Section: Related Workmentioning
confidence: 99%
“…Following UNITER and other existing work, we construct our pre-training data using two in-domain datasets, COCO (Lin et al, 2014) and Visual Genome (VG) (Krishna et al, 2016), and two out-of-domain datasets, SBU Captions (Ordonez et al, 2011) and Conceptual Captions (CC) (Sharma et al, 2018). The total number of unique images is 4.0M, and the number of image-text pairs is 5.1M.…”
Section: Pre-training Datasetsmentioning
confidence: 99%
See 1 more Smart Citation