Learning to Detect Human-Object Interactions

Chao, Yu-Wei; Liu, Yunfan; Liu, Xieyang; Zeng, Huayi; Deng, Jia

doi:10.48550/arxiv.1702.05448

Cited by 17 publications

(28 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visual Relationship Detection. Relationship detection among image constituents uses separate branches in a Con-vNet to model objects, humans, and their interactions [5,21]. A distinct approach in Santoro et al [60] treats each of the cells across channels in convolutional feature maps as an object and the relationships are modeled by a pairwise concatenation of the feature representations of individual cells.…”

Section: Related Workmentioning

confidence: 99%

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Xie,

Lai,

Doran

et al. 2019

Preprint

107

View full text Add to dashboard Cite

Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset.In this paper, we introduce a new inference task, Visual Entailment (VE) -consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE.

show abstract

Section: Related Workmentioning

confidence: 99%

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Xie,

Lai,

Doran

et al. 2019

Preprint

107

View full text Add to dashboard Cite

show abstract

“…More recently, with the release of large datasets like HICO (Chao et al 2015), Visual Genome (Krishna et al 2017), HCVRD (Zhuang et al 2017b), V-COCO (Gupta and Malik 2015), and HICO-Det (Chao et al 2017), the problem of detecting and recognizing HOIs has attracted signification attention. This has been driven by HICO which is a benchmark dataset for recognizing human-object interactions.…”

Section: Related Workmentioning

confidence: 99%

“…These atomic recognition tasks are certainly the building blocks of a variety of approaches for HOI understanding Delaitre, Sivic, and Laptev 2011); and the progress in these atomic tasks directly translates to improvements in HOI understanding. However, the task of HOI understanding comes with its own unique set of challenges (Lu et al 2016;Chao et al 2017).…”

Section: Introductionmentioning

confidence: 99%

Detecting Human-Object Interactions via Functional Generalization

Bansal¹,

Rambhatla²,

Shrivastava³

et al. 2019

Preprint

View full text Add to dashboard Cite

We present an approach for detecting human-object interactions (HOIs) in images, based on the idea that humans interact with functionally similar objects in a similar manner. The proposed model is simple and efficiently uses the data, visual features of the human, relative spatial orientation of the human and the object, and the knowledge that functionally similar objects take part in similar interactions with humans. We provide extensive experimental validation for our approach and demonstrate state-of-the-art results for HOI detection. On the HICO-Det dataset our method achieves a gain of over 2.5% absolute points in mean average precision (mAP) over stateof-the-art. We also show that our approach leads to significant performance gains for zero-shot HOI detection in the seen object setting. We further demonstrate that using a generic object detector, our model can generalize to interactions involving previously unseen objects. HICO-Det dataset (Chao et al. 2017) with 80 unique object classes and 117 predicates, there are 9,360 possible relationships. This number increases to more than 10 6 for

show abstract

“…The task of recognizing objects and the relationships has been investigated by numerous studies in a various form. This includes detection of human-object interactions [7,3], localization of proposals from natural language expressions [12], or the more general tasks of visual relationship detection [17,25,38,5,19,37,34,41] and scene graph generation [33,18,35,22].…”

Section: Relationship Detectionmentioning

confidence: 99%

“…The results show that our relational embedding represents inter-dependency among all object instances, being consistent with the ground-truth relationships. To illustrate, in the first example, the ground-truth matrix refers to the relationships between the 'man'(1) and his body parts (2,3); and the 'mountain'(0) and the 'rocks' (4,5,6,7), which are also reasonably captured in our relational embedding matrix. Note that our model infers relationship correctly even there exists missing ground-truths such as cell(7,0) due to sparsity of annotations in Visual Genome dataset.…”

Section: Qualitative Evaluationmentioning

confidence: 99%

LinkNet: Relational Embedding for Scene Graph

Woo¹,

Kim²,

Cho³

et al. 2018

Preprint

View full text Add to dashboard Cite

Objects and their relationships are critical contents for image understanding. A scene graph provides a structured description that captures these properties of an image. However, reasoning about the relationships between objects is very challenging and only a few recent works have attempted to solve the problem of generating a scene graph from an image. In this paper, we present a method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances. We design a simple and effective relational embedding module that enables our model to jointly represent connections among all related objects, rather than focus on an object in isolation. Our method significantly benefits main part of the scene graph generation task: relationship classification. Using it on top of a basic Faster R-CNN, our model achieves state-of-the-art results on the Visual Genome benchmark. We further push the performance by introducing global context encoding module and geometrical layout encoding module. We validate our final model, LinkNet, through extensive ablation studies, demonstrating its efficacy in scene graph generation.

show abstract

Learning to Detect Human-Object Interactions

Cited by 17 publications

References 28 publications

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Detecting Human-Object Interactions via Functional Generalization

LinkNet: Relational Embedding for Scene Graph

Contact Info

Product

Resources

About