End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

Wu, Mengyang; Gu, Jing; Shen, Yunhang; Lin, Mingbao; Chen, Chao; Sun, Xingming

doi:10.1609/aaai.v37i3.25385

Cited by 24 publications

(52 citation statements)

References 33 publications

(74 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…UPT [19] applied a unary-pairwise transformer to represent each target's instance details as unary and pairwise representations. In comparison to two-stage methods, one-stage solutions [9], [20]- [26] captured context information during the early stage of feature extraction, leading to improved HOI detection performance. The success of DETR [27] has inspired many researchers in studying HOI detection QPIC [28] applied additional detection heads and relied on a bipartite graph matching algorithm to locate HOI instances and identify interactions.…”

Section: A Human-object Interaction Detectionmentioning

confidence: 99%

“…The success of DETR [27] has inspired many researchers in studying HOI detection QPIC [28] applied additional detection heads and relied on a bipartite graph matching algorithm to locate HOI instances and identify interactions. EOID [9] developed a teacher-student model and designed a two-stage Hungarian matching algorithm. RR-Net [26] introduced a relation-aware frame to build progressive structure for interaction inference, which imitates the human visual mechanism of recognizing HOI by comprehending visual instances and interactions coherently.…”

Section: A Human-object Interaction Detectionmentioning

confidence: 99%

“…The most recent zero-shot HOI detectors leverage the comprehensive visual and linguistic knowledge of CLIP to detect novel HOIs. For example, the knowledge of pre-trained visual language model is transferred to EOID [9] and DOQ [10] via knowledge distillation to achieve zero-shot HOI detection. Previous approaches utilize cross-modal knowledge in a toorestrictive manner, failing to fully capitalize on the potential of cross-modal information and Language Models in the field of Human-Object Interaction (HOI) detection.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Zero-shot Object Detection Through Vision-Language Embedding Alignment

Xie

Zheng²

2022

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

View full text Add to dashboard Cite

Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. In this paper, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection. Specifically, the verb feature learning module is designed based on visual semantics, by employing the verb extraction decoder to convert corresponding verb queries into interaction-specific category representations. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Moreover, the innovative interaction representation decoder effectively extracts informative regions by integrating spatial and visual feature information through a cross-attention mechanism. To deal with zero-shot learning in low-data, we leverage a priori knowledge from the CLIP text encoder to initialize the linear classifier for enhanced interaction understanding. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings.

show abstract

Section: A Human-object Interaction Detectionmentioning

confidence: 99%

Section: A Human-object Interaction Detectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Zero-shot Object Detection Through Vision-Language Embedding Alignment

Xie

Zheng²

2022

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

View full text Add to dashboard Cite

show abstract

“…Recently, developments in leveraging pretrained vision and language knowledge, particularly through the large-scale Contrastive Language-Image Pretraining (CLIP) model [1], have shown promising results in a wide range of downstream tasks. These include but are not limited to image classification [2], object detection [3], and semantic segmentation [4], [5]. In the realm of text spotting, where scene text often provides rich visual and character information, the potential of the CLIP model is particularly evident.…”

Section: Introductionmentioning

confidence: 99%

Turning a CLIP Model into a Scene Text Detector

Yu,

Liu,

Hua

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.56% in text detection and spotting tasks, along with a 48.5% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 5.5% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.

show abstract

“…The current prevailing trend of transferring multimodal knowledge in vision-language pretraining models [44,23,30,62] to a diverse range of downstream tasks has been proven effective and achieved remarkable success [85,15,25,58]. A natural question arises: How can few-shot action recognition take advantage of the foundation model to mine the powerful multimodal knowledge?…”

Section: Introductionmentioning

confidence: 99%

Key Technology of Automation Control Based on Artificial Intelligence Technology

Wang

2021

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github. com/alibaba-mmai-research/MoLo.

show abstract

End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

Cited by 24 publications

References 33 publications

Zero-shot Object Detection Through Vision-Language Embedding Alignment

Zero-shot Object Detection Through Vision-Language Embedding Alignment

Turning a CLIP Model into a Scene Text Detector

Key Technology of Automation Control Based on Artificial Intelligence Technology

Contact Info

Product

Resources

About