Human-Object Interaction Detection via Disentangled Transformer

Zhou, Desen; Liu, Zhichao; Wang, Jian; Wang, Leshan; Hu, T.; Ding, Errui; Wang, Jingdong

doi:10.1109/cvpr52688.2022.01896

Cited by 22 publications

(6 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent transformer-based works (Zou et al 2021;Tamura, Ohashi, and Yoshinaga 2021;Kim et al 2021;Zhou et al 2022;Liao et al 2022;Lim et al 2023) leverage the encoder-decoder architecture to jointly represent human, object, and interaction features, and build their relationships implicitly. To learn the representations that focus on different feature regions, Kim et al (2023) propose three decoder branches to represent human, object, and interaction, respectively.…”

Section: One-stage Hoi Detection Methodsmentioning

confidence: 99%

“…Though the CNNbased methods can effectively model human-object relationships, they usually need post-grouping strategies to form a complete triplet. Recently, to capture long range contexts, the transformer-based works (Tamura, Ohashi, and Yoshinaga 2021;Zou et al 2021;Dong et al 2021;Kim et al 2022;Liao et al 2022;Zhou et al 2022;Kim, Jung, and Cho 2023) have greatly advanced the HOI detection using self-attention and cross-attention mechanisms. They directly predict a HOI triplet without extracting instance-level priors or exploring their dependencies, but this end-to-end way may suffer insufficient exchange between contextual clues, potentially leading to a sub-optimal solution.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Jiang,

Ren,

Tian

et al. 2024

AAAI

View full text Add to dashboard Cite

Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of . Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.

show abstract

Section: One-stage Hoi Detection Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Jiang,

Ren,

Tian

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Attention mechanisms have significantly transformed the field of computer vision, particularly in tasks such as action recognition, tracking, and various others, by improving the understanding of relational dynamics 36,37 . In the specific area of human-object interaction (HOI) detection, where modeling intricate relationships is essential, attention mechanisms have proven to be exceptionally beneficial [20][21][22][23][24] . The key models have demonstrated that attention significantly enhances performance.…”

Section: Attention Mechanismmentioning

confidence: 99%

Enhanced Human-Object Interaction Detection via Maximum IoU Partitioning and Chunk Block Attention

Lee,

Kang,

Park

et al. 2024

Preprint

View full text Add to dashboard Cite

Accurate understanding of spatial relationships between humans and objects is a key for recognizing the Human-Object Interaction (HOI).Due to its inherent simplicity and interpretability, Intersection over Union (IoU) has established itself as the predominant metric for expressing spatial relationships.This paper presents a simple yet effective attention mechanism that leverages the IoU metric.Specifically, we introduce a Chunk Block Attention (CBA) module that computes spatial-aware attention weights of image slices extracted through the IoU-guided Human-Object association method.The CBA uniquely integrates self-attention and cross-attention to capture internal and external relationships within slices.Through extensive experiments on various HOI models, we validate the effectiveness of the proposed method.Experimental results show consistent improvements in qualitative measure and impressive interpretability by visualizing the learned attention weights.Furthermore, we provide detailed analysis and visualizations of the attention weights learned by the CBA module.This reveals insights into how the model focuses on relevant spatial and semantic relationships for recognizing complex human-object interactions.

show abstract

“…The widespread application of Transformers in the field of Natural Language Processing (NLP) has proven their excellence and convenience in handling sequential data, which has also made them popular for visual tasks [35,36,44]; [45,46]. ViT [35] addresses the high computational cost issue of Transformers in traditional visual tasks by flattening images into a series of pixel blocks (patches), transforming image processing tasks into a form similar to the word sequence processing in NLP.…”

Section: Vision Transformermentioning

confidence: 99%

Cross-modality feature fusion for night pedestrian detection

Feng,

Luo,

et al. 2024

Front. Phys.

View full text Add to dashboard Cite

Night pedestrian detection with visible image only suffers from the dilemma of high miss rate due to poor illumination conditions. Cross-modality fusion can ameliorate this dilemma by providing complementary information to each other through infrared and visible images. In this paper, we propose a cross-modal fusion framework based on YOLOv5, which is aimed at addressing the challenges of night pedestrian detection under low-light conditions. The framework employs a dual-stream architecture that processes visible images and infrared images separately. Through the Cross-Modal Feature Rectification Module (CMFRM), visible and infrared features are finely tuned on a granular level, leveraging their spatial correlations to focus on complementary information and substantially reduce uncertainty and noise from different modalities. Additionally, we have introduced a two-stage Feature Fusion Module (FFM), with the first stage introducing a cross-attention mechanism for cross-modal global reasoning, and the second stage using a mixed channel embedding to produce enhanced feature outputs. Moreover, our method involves multi-dimensional interaction, not only correcting feature maps in terms of channel and spatial dimensions but also applying cross-attention at the sequence processing level, which is critical for the effective generalization of cross-modal feature combinations. In summary, our research significantly enhances the accuracy and robustness of nighttime pedestrian detection, offering new perspectives and technical pathways for visual information processing in low-light environments.

show abstract

Human-Object Interaction Detection via Disentangled Transformer

Cited by 22 publications

References 25 publications

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Enhanced Human-Object Interaction Detection via Maximum IoU Partitioning and Chunk Block Attention

Cross-modality feature fusion for night pedestrian detection

Contact Info

Product

Resources

About