PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN

Zhang, Hanwang; Kyaw, Zawlin; Yu, Jinyang; Chang, Shih-Fu

doi:10.1109/iccv.2017.454

Cited by 131 publications

(109 citation statements)

References 44 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…To our best knowledge, it is the only work on unsupervised referring expression grounding. Note that it is also known as "weakly supervised" detection [60] as there is still image-level ground truth (i.e., the referring expression). Table 3 reports the unsupervised results on the RefCLEF.…”

Section: Evaluations Of Unsupervised Groundingmentioning

confidence: 99%

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Niu

Zhang

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context -visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Specifically, our framework exploits the reciprocal relation between the referent and context, i.e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced. In addition to reciprocity, our framework considers the semantic information of context, i.e., the referring expression can be reproduced based on the estimated context. We also extend the model to unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings.

show abstract

Section: Evaluations Of Unsupervised Groundingmentioning

confidence: 99%

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Niu

Zhang

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [30] an end-to-end system exploits the interaction of visual and geometric features of the subject, object and predicate. The end-to-end system in [34] exploits weakly supervised learning (i.e., the supervision is at image level). LTNs exploit the combination of the visual/geometric features of the subject/object with additional background knowledge.…”

Section: Related Workmentioning

confidence: 99%

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

Donadello

Serafini

2019

2019 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Semantic Image Interpretation is the task of extracting a structured semantic description from images. This requires the detection of visual relationships: triples subject, relation, object describing a semantic relation between a subject and an object. A pure supervised approach to visual relationship detection requires a complete and balanced training set for all the possible combinations of subject, relation, object . However, such training sets are not available and would require a prohibitive human effort. This implies the ability of predicting triples which do not appear in the training set. This problem is called zero-shot learning. State-of-the-art approaches to zero-shot learning exploit similarities among relationships in the training set or external linguistic knowledge. In this paper, we perform zero-shot learning by using Logic Tensor Networks, a novel Statistical Relational Learning framework that exploits both the similarities with other seen relationships and background knowledge, expressed with logical constraints between subjects, relations and objects. The experiments on the Visual Relationship Dataset show that the use of logical constraints outperforms the current methods. This implies that background knowledge can be used to alleviate the incompleteness of training sets.

show abstract

“…The whole dataset is split into 73,801 images for training and 25,857 images for testing. [41] 62.87 62.63 10.45 9.46 6.04 5.52 Shuffle [38] 62.94 62.71 ----VSA-Net [12] 64.53 64.41 9.97 9.72 6.28 6.02 PPR-FCN [42] 64.86 64. 17 We compare our complete model denoting "RLM(ours)" with some existing methods.…”

Section: Experiments On Visual Genomementioning

confidence: 99%

“…Visual relationship detection can be divided into two stages, including object-pairs proposing stage and predicate recognition stage. Traditional methods [22,42] follow the simple framework: given N detection objects, N 2 object-pairs are proposed in objectpairs proposing stage. The main problem is that the performance of relationship models is heavily dependent on N .…”

Section: Introductionmentioning

confidence: 99%

Visual Relationship Detection with Relative Location Mining

Zhou

Zhang

2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual relationship detection, as a challenging task used to find and distinguish the interactions between object pairs in one image, has received much attention recently. In this work, we propose a novel visual relationship detection framework by deeply mining and utilizing relative location of object-pair in every stage of the procedure. In both the stages, relative location information of each object-pair is abstracted and encoded as auxiliary feature to improve the distinguishing capability of object-pairs proposing and predicate recognition, respectively; Moreover, one Gated Graph Neural Network(GGNN) is introduced to mine and measure the relevance of predicates using relative location. With the locationbased GGNN, those non-exclusive predicates with similar spatial position can be clustered firstly and then be smoothed with close classification scores, thus the accuracy of top n recall can be increased further. Experiments on two widely used datasets VRD and VG show that, with the deeply mining and exploiting of relative location information, our proposed model significantly outperforms the current state-of-the-art.

show abstract

PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN

Cited by 131 publications

References 44 publications

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

Visual Relationship Detection with Relative Location Mining

Contact Info

Product

Resources

About