ViP-CNN: Visual Phrase Guided Convolutional Neural Network

Li, Yikang; Ouyang, Wanli; Wang, Xiaogang; Tang, Xiaoou

doi:10.1109/cvpr.2017.766

Cited by 227 publications

(183 citation statements)

References 58 publications

(115 reference statements)

Supporting

Mentioning

180

Contrasting

Order By: Relevance

“…The Visual Genome (VG) dataset is one of the largest relationship detection datasets. We note that there are multiple versions of VG datasets [20,33,34,37]. In this paper, we use the pruned version of the VG dataset provided by [37].…”

Section: Datasets Evaluation Tasks and Metricsmentioning

confidence: 99%

On Exploring Undetermined Relationships for Visual Relationship Detection

Zhan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

In visual relationship detection, human-notated relationships can be regarded as determinate relationships. However, there are still large amount of unlabeled data, such as object pairs with less significant relationships or even with no relationships. We refer to these unlabeled but potentially useful data as undetermined relationships. Although a vast body of literature exists, few methods exploit these undetermined relationships for visual relationship detection.In this paper, we explore the beneficial effect of undetermined relationships on visual relationship detection. We propose a novel multi-modal feature based undetermined relationship learning network (MF-URLN) and achieve great improvements in relationship detection. In detail, our MF-URLN automatically generates undetermined relationships by comparing object pairs with human-notated data according to a designed criterion. Then, the MF-URLN extracts and fuses features of object pairs from three complementary modals: visual, spatial, and linguistic modals. Further, the MF-URLN proposes two correlated subnetworks: one subnetwork decides the determinate confidence, and the other predicts the relationships. We evaluate the MF-URLN on two datasets: the Visual Relationship Detection (VRD) and the Visual Genome (VG) datasets. The experimental results compared with state-of-the-art methods verify the significant improvements made by the undetermined relationships, e.g., the top-50 relation detection recall improves from 19.5% to 23.9% on the VRD dataset.

show abstract

Section: Datasets Evaluation Tasks and Metricsmentioning

confidence: 99%

On Exploring Undetermined Relationships for Visual Relationship Detection

Zhan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…In the object-pairs proposing stage, [16] proposes a triplet proposal with NMS, based on the product of objectiveness scores, to remove redundant object-pairs. However, there exists a gap between higher objectiveness scores and more meaningful objectpairs obviously.…”

Section: Related Workmentioning

confidence: 99%

“…whereP is the objectiveness score from object detection module. Inspired by greedy NMS [11] and triplet NMS [16], shown in Algorithm 1, object-pairs proposing scheme is based on rating scores and improved NMS(i-NMS).…”

Section: Rating Scores and I-nms Based Object-pair Proposingmentioning

confidence: 99%

“…However, this kind of object-pairs proposing scheme faces one big challenge: how to select M reasonable object-pairs from N 2 possible combinations? Some works [16,40] attempt to reserve specific object-pairs proposals based on the objectiveness scores from detection model. But there is a large gap between the higher objectiveness scores and the more meaningful object-pairs, which deteriorates performances inevitably in this proposing scheme.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Visual Relationship Detection with Relative Location Mining

Zhou

Zhang

2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual relationship detection, as a challenging task used to find and distinguish the interactions between object pairs in one image, has received much attention recently. In this work, we propose a novel visual relationship detection framework by deeply mining and utilizing relative location of object-pair in every stage of the procedure. In both the stages, relative location information of each object-pair is abstracted and encoded as auxiliary feature to improve the distinguishing capability of object-pairs proposing and predicate recognition, respectively; Moreover, one Gated Graph Neural Network(GGNN) is introduced to mine and measure the relevance of predicates using relative location. With the locationbased GGNN, those non-exclusive predicates with similar spatial position can be clustered firstly and then be smoothed with close classification scores, thus the accuracy of top n recall can be increased further. Experiments on two widely used datasets VRD and VG show that, with the deeply mining and exploiting of relative location information, our proposed model significantly outperforms the current state-of-the-art.

show abstract

“…The task of Visual Relationship Detection has been the main focus of several recent works (Lu et al, 2016;Li et al, 2017a;Zhang et al, 2017a;Dai et al, 2017;Hu et al, 2017;Liang et al, 2017;Yin et al, 2018). The goal is to detect a generic <subject, predicate, object> triplet present in an image.…”

Section: Related Workmentioning

confidence: 99%

Learning to Relate from Captions and Bounding Boxes

Garg¹,

Moniz²,

Aviral³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15% and recall@100 of 25% on the relationships present in the image. We also show that the model successfully predicts relations that are not present in the corresponding captions.

show abstract

ViP-CNN: Visual Phrase Guided Convolutional Neural Network

Cited by 227 publications

References 58 publications

On Exploring Undetermined Relationships for Visual Relationship Detection

On Exploring Undetermined Relationships for Visual Relationship Detection

Visual Relationship Detection with Relative Location Mining

Learning to Relate from Captions and Bounding Boxes

Contact Info

Product

Resources

About