Scene Graph Generation With External Knowledge and Image Reconstruction

Gu, Jiuxiang; Zhao, Handong; Lin, Zhe; Li, Sheng; Cai, Jianfei; Ling, Mingyang

doi:10.1109/cvpr.2019.00207

Cited by 293 publications

(205 citation statements)

References 41 publications

Supporting

Mentioning

205

Contrasting

Order By: Relevance

“…Vision and language are two important aspects of human intelligence to understand the real world. A large amount of research [5,9,23] has been done to bridge these two modalities. Image-text matching is one of the fundamental topics in this field, which refers to measuring the visualsemantic similarity between a sentence and an image.…”

Section: Introductionmentioning

confidence: 99%

Visual Semantic Reasoning for Image-Text Matching

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

446

366

View full text Add to dashboard Cite

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this issue, we propose a simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene. Specifically, we first build up connections between image regions and perform reasoning with Graph Convolutional Networks to generate features with semantic relationships. Then, we propose to use the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually generate the representation for the whole scene. Experiments validate that our method achieves a new state-of-the-art for the image-text matching on MS-COCO [28] and Flickr30K [39] datasets. It outperforms the current best method by 6.8% relatively for image retrieval and 4.8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set). On Flickr30K, our model improves image retrieval by 12.6% relatively and caption retrieval by 5.8% relatively (Re-call@1). Our code is available at https://github. com/KunpengLi1994/VSRN .

show abstract

Section: Introductionmentioning

confidence: 99%

Visual Semantic Reasoning for Image-Text Matching

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

446

366

View full text Add to dashboard Cite

show abstract

“…Through our experiments on Visual Genome [30], a dataset containing visual relationship data, we show that the object representations generated by the predicate functions result in meaningful features that can be used to enable few-shot scene graph prediction, exceeding existing transfer learning approaches by 4.16 at recall@1 with 5 labelled examples. We further justify our design decisions by demonstrating that our scene graph model performs on par with existing state-of-the-art models and even outperforms models that also do not utilize external knowledge bases [18], linguistic priors [39,58] or rely on complicated pre-and post-processing heuristics [58,6]. We run ablations where we remove the semantic or spatial components of our functions and demonstrate that both components lead to increased performance but the semantic component is responsible for most of the performance.…”

Section: Introductionmentioning

confidence: 88%

“…This includes Iterative Message Passing (IMP) [55], Multi-level scene Description Network (MSDN) [35], ViP-CNN [33], MotifNet-freq [58]. The second category includes models such as Factorizable Net [34], KB-GAN [18] and MotifNet [58], which use linguistic priors in the form of word vectors or external information from knowledge bases while MotifNet also deploys a custom trained object detector, class-conditioned non-maximum suppression, and heuristically removes all object pairs that do not overlap. While not comparable, we report their numbers for clarity.…”

Section: Baselinesmentioning

confidence: 99%

Visual Relationships as Functions:Enabling Few-Shot Scene Graph Prediction

Dornadula

Narcomey

Krishna

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Scene graph prediction -classifying the set of objects and predicates in a visual scene -requires substantial training data. The long-tailed distribution of relationships can be an obstacle for such approaches, however, as they can only be trained on the small set of predicates that carry sufficient labels. We introduce the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates. First, we introduce a new model of predicates as functions that operate on object features or image locations. Next, we define a scene graph model where these functions are trained as message passing protocols within a new graph convolution framework. We train the framework with a frequently occurring set of predicates and show that our approach outperforms those that use the same amount of supervision by 1.78 at recall@50 and performs on par with other scene graph models. Next, we extract object representations generated by the trained predicate functions to train few-shot predicate classifiers on rare predicates with as few as 1 labeled example. When compared to strong baselines like transfer learning from existing state-of-the-art representations, we show improved 5-shot performance by 4.16 recall@1. Finally, we show that our predicate functions generate interpretable visualizations, enabling the first interpretable scene graph model.Preprint. Under review.

show abstract

“…Scene segmentation (or scene parsing, semantic segmentation) is one of the fundamental problems in computer vision and has drawn lots of attentions. Recently, thanks to the great success of Convolutional Neural Networks (CNNs) in computer vision [42,68,71,52,25,72,27,80,26], lots of CNNs based segmentation works have been proposed and have achieved great progress [29,22,81,83,84,70,60]. For example, Long et al [54] introduce the fully convolutional networks (FCN) in which the fully connected layers in standard CNNs are transformed to convolutional layers.…”

Section: Related Work 21 Scene Segmentationmentioning

confidence: 99%

Boundary-Aware Feature Propagation for Scene Segmentation

Ding

Jiang

Liu

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

228

107

View full text Add to dashboard Cite

In this work, we address the challenging issue of scene segmentation. To increase the feature similarity of the same object while keeping the feature discrimination of different objects, we explore to propagate information throughout the image under the control of objects' boundaries. To this end, we first propose to learn the boundary as an additional semantic class to enable the network to be aware of the boundary layout. Then, we propose unidirectional acyclic graphs (UAGs) to model the function of undirected cyclic graphs (UCGs), which structurize the image via building graphic pixel-by-pixel connections, in an efficient and effective way. Furthermore, we propose a boundaryaware feature propagation (BFP) module to harvest and propagate the local features within their regions isolated by the learned boundaries in the UAG-structured image. The proposed BFP is capable of splitting the feature propagation into a set of semantic groups via building strong connections among the same segment region but weak connections between different segment regions. Without bells and whistles, our approach achieves new state-of-the-art segmentation performance on three challenging semantic segmentation datasets, i.e., PASCAL-Context, CamVid, and Cityscapes.

show abstract

Scene Graph Generation With External Knowledge and Image Reconstruction

Cited by 293 publications

References 41 publications

Visual Semantic Reasoning for Image-Text Matching

Visual Semantic Reasoning for Image-Text Matching

Visual Relationships as Functions:Enabling Few-Shot Scene Graph Prediction

Boundary-Aware Feature Propagation for Scene Segmentation

Contact Info

Product

Resources

About