Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna, Ranjay; Zhu, Yuke; Groth, Oliver; Johnson, Justin; Hata, Kenji; Kravitz, Joshua; Chen, Stephanie; Kalantidis, Yannis; Li, Lijia; Shamma, David A.; Bernstein, Michael S.; Li, Feifei

doi:10.48550/arxiv.1602.07332

Cited by 187 publications

(201 citation statements)

References 0 publications

Supporting

Mentioning

195

Contrasting

Order By: Relevance

“…The proposed method can be improved in various ways. For one, RWFNs can be employed in tasks that should extract structural knowledge from images as well as text, such as visual question-answering using Visual Genome dataset (Krishna et al 2016). Moreover, other perspectives from neuroscience may lead to biologically plausible learning algorithms that might apply to further optimizations of RWFNs (Krotov and Hopfield 2019;Kasai et al 2021;Kappel et al 2018).…”

Section: Discussionmentioning

confidence: 99%

Representing Prior Knowledge Using Randomly, Weighted Feature Networks for Visual Relationship Detection

Hong

Pavlic

2021

Preprint

View full text Add to dashboard Cite

The single-hidden-layer Randomly Weighted Feature Network (RWFN) introduced by Hong and Pavlic (2021) was developed as an alternative to neural tensor network approaches for relational learning tasks. Its relatively small footprint combined with the use of two randomized input projections -an insect-brain-inspired input representation and random Fourier features -allow it to achieve rich expressiveness for relational learning with relatively low training cost. In particular, when Hong and Pavlic compared RWFN to Logic Tensor Networks (LTNs) for Semantic Image Interpretation (SII) tasks to extract structured semantic descriptions from images, they showed that the RWFN integration of the two hidden, randomized representations better captures relationships among inputs with a faster training process even though it uses far fewer learnable parameters. In this paper, we use RWFNs to perform Visual Relationship Detection (VRD) tasks, which are more challenging SII tasks. A zero-shot learning approach is used with RWFN that can exploit similarities with other seen relationships and background knowledge -expressed with logical constraints between subjects, relations, and objects -to achieve the ability to predict triples that do not appear in the training set. The experiments on the Visual Relationship Dataset to compare the performance between RWFNs and LTNs, one of the leading Statistical Relational Learning frameworks, show that RWFNs outperform LTNs for the predicate-detection task while using fewer number of adaptable parameters (1 : 56 ratio). Furthermore, background knowledge represented by RWFNs can be used to alleviate the incompleteness of training sets even though the space complexity of RWFNs is much smaller than LTNs (1 : 27 ratio).

show abstract

Section: Discussionmentioning

confidence: 99%

Representing Prior Knowledge Using Randomly, Weighted Feature Networks for Visual Relationship Detection

Hong

Pavlic

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Visual Question Answering (VQA) The conventional visual question answering (VQA) task aims to answer questions pertaining to a given image. Multiple VQA datasets have been proposed, such as Visual Genome QA [25] VQA [2], GQA [16], CLEVR [22], MovieQA [53] and so on. Many works have shown state-of-the-art performance on VQA tasks, including task-specific VQA models with various cross-modality fusion mechanisms [13,20,24,49,62,66,67] and joint vision-language models that are pretrained on large-scale vision-language corpus and finetuned on VQA tasks [6,11,29,30,33,52,68].…”

Section: Related Workmentioning

confidence: 99%

A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

Gao¹,

Peng²,

Thattai³

et al. 2022

Preprint

View full text Add to dashboard Cite

Outside-knowledge visual question answering (OK-VQA) requires the agent to comprehend the image, make use of relevant knowledge from the entire web, and digest all the information to answer the question. Most previous works address the problem by first fusing the image and question in the multi-modal space, which is inflexible for further fusion with a vast amount of external knowledge. In this paper, we call for a paradigm shift for the OK-VQA task, which transforms the image into plain text, so that we can enable knowledge passage retrieval, and generative question-answering in the natural language space. This paradigm takes advantage of the sheer volume of gigantic knowledge bases and the richness of pretrained language models. A Transform-Retrieve-Generate framework (TRiG) framework is proposed 1 , which can be plug-and-played with alternative image-to-text models and textual knowledge bases. Experimental results show that our TRiG framework outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.

show abstract

“…Object detectors, such as Faster R-CNN (Ren et al, 2015), Bottom-Up and Top-Down Attention (BUTD) (Anderson et al, 2018), are trained on image annotations of common objects, e.g. COCO (Lin et al, 2014) (100K images) and Visual Genome (Krishna et al, 2016) (100K). VinVL has achieved SoTA performances on many V+L tasks by utilizing a powerful object detector pre-trained with a very large collection of image annotations (2.5M images).…”

Section: Related Workmentioning

confidence: 99%

“…Following UNITER and other existing work, we construct our pre-training data using two in-domain datasets, COCO (Lin et al, 2014) and Visual Genome (VG) (Krishna et al, 2016), and two out-of-domain datasets, SBU Captions (Ordonez et al, 2011) and Conceptual Captions (CC) (Sharma et al, 2018). The total number of unique images is 4.0M, and the number of image-text pairs is 5.1M.…”

Section: Pre-training Datasetsmentioning

confidence: 99%

“…Taking the image-text pair in Figure 1 as example, the text "a man wearing a backpack is crossing the street" is about the whole image, the text "man crossing the street" is only related to a region in the image, and the word "backpack" is only concerned with the referred object in the image. That is to say, the texts in fact describe different 'visual concepts' (Krishna et al, 2016;Changpinyo et al, 2021) in different granularities in the image. Unfortunately, existing methods either depend on object-centric features or overall features and cannot satisfactorily handle the alignments between texts and visual concepts.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Zeng¹,

Zhang²,

Li³

2021

Preprint

View full text Add to dashboard Cite

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection, and make fine-grained alignments between the extracted features and texts. We argue that the use of object detection may not be suitable for vision language pre-training. Instead, we point out that the task should be performed so that the regions of 'visual concepts' mentioned in the texts are located in the images, and in the meantime alignments between texts and visual concepts are identified, where the alignments are in multi-granularity. This paper proposes a new method called X-VLM to perform 'multigrained vision language pre-training'. Experimental results show that X-VLM consistently outperforms state-of-the-art methods in many downstream vision language tasks.

show abstract

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Cited by 187 publications

References 0 publications

Representing Prior Knowledge Using Randomly, Weighted Feature Networks for Visual Relationship Detection

Representing Prior Knowledge Using Randomly, Weighted Feature Networks for Visual Relationship Detection

A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Contact Info

Product

Resources

About