2022
DOI: 10.48550/arxiv.2202.00259
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Abstract: Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: objectguided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 51 publications
0
2
0
Order By: Relevance
“…Query-Based Anchors for human-object interaction (QAHOI) [11] used a framework based on a deformable DETR for detection, which was able to extract and merge feature information from different scales and identify interaction features that were overlooked by single-scale methods, thus greatly improving its detection accuracy. An Object-guided Cross-modal Calibration Network (OCN) [12] introduced additional semantic information to guide HOI detection, and proposed a verb-semantic model (VSM) to generate semantic features and incorporate both visual and semantic features into the reference for detection results.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Query-Based Anchors for human-object interaction (QAHOI) [11] used a framework based on a deformable DETR for detection, which was able to extract and merge feature information from different scales and identify interaction features that were overlooked by single-scale methods, thus greatly improving its detection accuracy. An Object-guided Cross-modal Calibration Network (OCN) [12] introduced additional semantic information to guide HOI detection, and proposed a verb-semantic model (VSM) to generate semantic features and incorporate both visual and semantic features into the reference for detection results.…”
Section: Related Workmentioning
confidence: 99%
“…Previous experimental results and related studies have shown that there is a strong correlation between detection accuracy and the amount of input feature information in HOI tasks. Some methods [11,12] used deeper backbone networks to replace the ResNet50 backbone, such as Swin-B [21] and ResNet101, which improve detection accuracy while significantly increasing the computational demand. Therefore, adding a downsampling model alone would not achieve our goal.…”
Section: Sample Modelmentioning
confidence: 99%