2023
DOI: 10.1609/aaai.v37i3.25385
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

Abstract: Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
17
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(52 citation statements)
references
References 33 publications
(74 reference statements)
0
17
0
Order By: Relevance
“…UPT [19] applied a unary-pairwise transformer to represent each target's instance details as unary and pairwise representations. In comparison to two-stage methods, one-stage solutions [9], [20]- [26] captured context information during the early stage of feature extraction, leading to improved HOI detection performance. The success of DETR [27] has inspired many researchers in studying HOI detection QPIC [28] applied additional detection heads and relied on a bipartite graph matching algorithm to locate HOI instances and identify interactions.…”
Section: A Human-object Interaction Detectionmentioning
confidence: 99%
See 2 more Smart Citations
“…UPT [19] applied a unary-pairwise transformer to represent each target's instance details as unary and pairwise representations. In comparison to two-stage methods, one-stage solutions [9], [20]- [26] captured context information during the early stage of feature extraction, leading to improved HOI detection performance. The success of DETR [27] has inspired many researchers in studying HOI detection QPIC [28] applied additional detection heads and relied on a bipartite graph matching algorithm to locate HOI instances and identify interactions.…”
Section: A Human-object Interaction Detectionmentioning
confidence: 99%
“…The success of DETR [27] has inspired many researchers in studying HOI detection QPIC [28] applied additional detection heads and relied on a bipartite graph matching algorithm to locate HOI instances and identify interactions. EOID [9] developed a teacher-student model and designed a two-stage Hungarian matching algorithm. RR-Net [26] introduced a relation-aware frame to build progressive structure for interaction inference, which imitates the human visual mechanism of recognizing HOI by comprehending visual instances and interactions coherently.…”
Section: A Human-object Interaction Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, developments in leveraging pretrained vision and language knowledge, particularly through the large-scale Contrastive Language-Image Pretraining (CLIP) model [1], have shown promising results in a wide range of downstream tasks. These include but are not limited to image classification [2], object detection [3], and semantic segmentation [4], [5]. In the realm of text spotting, where scene text often provides rich visual and character information, the potential of the CLIP model is particularly evident.…”
Section: Introductionmentioning
confidence: 99%
“…The current prevailing trend of transferring multimodal knowledge in vision-language pretraining models [44,23,30,62] to a diverse range of downstream tasks has been proven effective and achieved remarkable success [85,15,25,58]. A natural question arises: How can few-shot action recognition take advantage of the foundation model to mine the powerful multimodal knowledge?…”
Section: Introductionmentioning
confidence: 99%