Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Hangjie, Yuan,; Wang, Mang; Ni, Dong; Xu, Lijuan

doi:10.48550/arxiv.2202.00259

Cited by 1 publication

(2 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Query-Based Anchors for human-object interaction (QAHOI) [11] used a framework based on a deformable DETR for detection, which was able to extract and merge feature information from different scales and identify interaction features that were overlooked by single-scale methods, thus greatly improving its detection accuracy. An Object-guided Cross-modal Calibration Network (OCN) [12] introduced additional semantic information to guide HOI detection, and proposed a verb-semantic model (VSM) to generate semantic features and incorporate both visual and semantic features into the reference for detection results.…”

Section: Related Workmentioning

confidence: 99%

“…Previous experimental results and related studies have shown that there is a strong correlation between detection accuracy and the amount of input feature information in HOI tasks. Some methods [11,12] used deeper backbone networks to replace the ResNet50 backbone, such as Swin-B [21] and ResNet101, which improve detection accuracy while significantly increasing the computational demand. Therefore, adding a downsampling model alone would not achieve our goal.…”

Section: Sample Modelmentioning

confidence: 99%

See 1 more Smart Citation

Human–Object Interaction Detection with Ratio-Transformer

et al. 2022

View full text Add to dashboard Cite

Human–object interaction (HOI) is a human-centered object detection task that aims to identify the interactions between persons and objects in an image. Previous end-to-end methods have used the attention mechanism of a transformer to spontaneously identify the associations between persons and objects in an image, which effectively improved detection accuracy; however, a transformer can increase computational demands and slow down detection processes. In addition, the end-to-end method can result in asymmetry between foreground and background information. The foreground data may be significantly less than the background data, while the latter consumes more computational resources without significantly improving detection accuracy. Therefore, we proposed an input-controlled transformer, “ratio-transformer” to solve an HOI task, which could not only limit the amount of information in the input transformer by setting a sampling ratio, but also significantly reduced the computational demands while ensuring detection accuracy. The ratio-transformer consisted of a sampling module and a transformer network. The sampling module divided the input feature map into foreground versus background features. The irrelevant background features were a pooling sampler, which were then fused with the foreground features as input data for the transformer. As a result, the valid data input into the Transformer network remained constant, while irrelevant information was significantly reduced, which maintained the foreground and background information symmetry. The proposed network was able to learn the feature information of the target itself and the association features between persons and objects so it could query to obtain the complete HOI interaction triplet. The experiments on the VCOCO dataset showed that the proposed method reduced the computational demand of the transformer by 57% without any loss of accuracy, as compared to other current HOI methods.

show abstract

Section: Related Workmentioning

confidence: 99%