QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information

Tamura, Masato; Ohashi, Haruhiko; Yoshinaga, Tomoaki

doi:10.1109/cvpr46437.2021.01027

Cited by 136 publications

(157 citation statements)

References 22 publications

Supporting

Mentioning

157

Contrasting

Order By: Relevance

“…Moreover, there also appeared one-stage methods [39] that directly detected HOI triplets. Besides the works based on convolutional neural networks (CNN), recently Transformer-based methods [40], [41] are proposed and achieved decent improvements.…”

Section: Related Workmentioning

confidence: 99%

“…In Tab. 1, on the challenging HICO-DET [1], the upper bounds are 45.52 (+QPIC [41], detection [41]) and 62.65 (GT humanobject boxes) mAP, which are significantly superior to the state-of-the-arts (about 29 mAP [41] and 44 mAP [38]). Here, detection [41] indicates using the detected human-object boxes from [41].…”

Section: Analyzing the Upper Bound Of Hakementioning

confidence: 99%

“…1, on the challenging HICO-DET [1], the upper bounds are 45.52 (+QPIC [41], detection [41]) and 62.65 (GT humanobject boxes) mAP, which are significantly superior to the state-of-the-arts (about 29 mAP [41] and 44 mAP [38]). Here, detection [41] indicates using the detected human-object boxes from [41]. On AVA [65], the upper bound of HAKE is also impressive, i.e., 42.23 (+SlowFast [64], detection [64]) and 47.27 (GT human boxes) mAP, which also largely outperform the SlowFast [64] (about 28 and 34 mAP).…”

Section: Analyzing the Upper Bound Of Hakementioning

confidence: 99%

See 2 more Smart Citations

HAKE: A Knowledge Engine Foundation for Human Activity Understanding

Li¹,

Liu²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human activity understanding is of widespread interest in artificial intelligence and spans diverse applications like health care and behavior analysis. Although there have been advances with deep learning, it remains challenging. The object recognition-like solutions usually try to map pixels to semantics directly, but activity patterns are much different from object patterns, thus hindering another success. In this work, we propose a novel paradigm to reformulate this task in two-stage: first mapping pixels to an intermediate space spanned by atomic activity primitives, then programming detected primitives with interpretable logic rules to infer semantics. To afford a representative primitive space, we build a knowledge base including 26+ M primitive labels and logic rules from human priors or automatic discovering. Our framework, Human Activity Knowledge Engine (HAKE), exhibits superior generalization ability and performance upon canonical methods on challenging benchmarks. Code and data are available at http://hake-mvig.cn/.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Analyzing the Upper Bound Of Hakementioning

confidence: 99%

Section: Analyzing the Upper Bound Of Hakementioning

confidence: 99%

See 1 more Smart Citation

HAKE: A Knowledge Engine Foundation for Human Activity Understanding

Li¹,

Liu²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…One-stage methods [20][21][22] execute object detection and HOI detection concurrently and pair them afterwards. Recent studies [23][24][25][26] achieve end-to-end HOI detection with a DETR [27] style network and benefit from the wider perception field of transformers [26].…”

Section: Related Workmentioning

confidence: 99%

Is Object Detection Necessary for Human-Object Interaction Recognition?

Jin,

Chen,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose. We name it detectionfree HOI recognition, in contrast to the existing detection-supervised approaches which rely on object and keypoint detections to achieve state of the art. With our method, not only the detection supervision is evitable, but superior performance can be achieved by properly using image-text pre-training (such as CLIP) and the proposed Log-Sum-Exp Sign (LSE-Sign) loss function. Specifically, using text embeddings of class labels to initialize the linear classifier is essential for leveraging the CLIP pre-trained image encoder. In addition, LSE-Sign loss facilitates learning from multiple labels on an imbalanced dataset by normalizing gradients over all classes in a softmax format. Surprisingly, our detection-free solution achieves 60.5 mAP on the HICO dataset, outperforming the detection-supervised state of the art by 13.4 mAP.

show abstract

“…In order to apply the proposed VSM and CMC into practice, we select an end-to-end vision model (VM) (Zou et al 2021;Tamura, Ohashi, and Yoshinaga 2021) as our VM and compose Object-guided Cross-modal Calibration Network (OCN). To conclude, our contributions are three-fold:…”

Section: Introductionmentioning

confidence: 99%

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Hangjie¹,

Wang²,

Ni³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: objectguided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy. Similarity KL (SKL) loss is proposed to optimize VSM to align with the HOI dataset's priors. To overcome the static semantic embedding problem, we propose to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC). The above modules combined composes Object-guided Cross-modal Calibration Network (OCN). Experiments conducted on two popular HOI detection benchmarks demonstrate the significance of incorporating the statistical prior knowledge and produce state-of-the-art performances. More detailed analysis indicates proposed modules serve as a stronger verb predictor and a more superior method of utilizing prior knowledge. The codes are available at https://github.com/JacobYuan7/OCN-HOI-Benchmark.

show abstract

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information

Cited by 136 publications

References 22 publications

HAKE: A Knowledge Engine Foundation for Human Activity Understanding

HAKE: A Knowledge Engine Foundation for Human Activity Understanding

Is Object Detection Necessary for Human-Object Interaction Recognition?

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Contact Info

Product

Resources

About