2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01896
|View full text |Cite
|
Sign up to set email alerts
|

Human-Object Interaction Detection via Disentangled Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(6 citation statements)
references
References 25 publications
0
6
0
Order By: Relevance
“…Recent transformer-based works (Zou et al 2021;Tamura, Ohashi, and Yoshinaga 2021;Kim et al 2021;Zhou et al 2022;Liao et al 2022;Lim et al 2023) leverage the encoder-decoder architecture to jointly represent human, object, and interaction features, and build their relationships implicitly. To learn the representations that focus on different feature regions, Kim et al (2023) propose three decoder branches to represent human, object, and interaction, respectively.…”
Section: One-stage Hoi Detection Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Recent transformer-based works (Zou et al 2021;Tamura, Ohashi, and Yoshinaga 2021;Kim et al 2021;Zhou et al 2022;Liao et al 2022;Lim et al 2023) leverage the encoder-decoder architecture to jointly represent human, object, and interaction features, and build their relationships implicitly. To learn the representations that focus on different feature regions, Kim et al (2023) propose three decoder branches to represent human, object, and interaction, respectively.…”
Section: One-stage Hoi Detection Methodsmentioning
confidence: 99%
“…Though the CNNbased methods can effectively model human-object relationships, they usually need post-grouping strategies to form a complete triplet. Recently, to capture long range contexts, the transformer-based works (Tamura, Ohashi, and Yoshinaga 2021;Zou et al 2021;Dong et al 2021;Kim et al 2022;Liao et al 2022;Zhou et al 2022;Kim, Jung, and Cho 2023) have greatly advanced the HOI detection using self-attention and cross-attention mechanisms. They directly predict a HOI triplet without extracting instance-level priors or exploring their dependencies, but this end-to-end way may suffer insufficient exchange between contextual clues, potentially leading to a sub-optimal solution.…”
Section: Introductionmentioning
confidence: 99%
“…Attention mechanisms have significantly transformed the field of computer vision, particularly in tasks such as action recognition, tracking, and various others, by improving the understanding of relational dynamics 36,37 . In the specific area of human-object interaction (HOI) detection, where modeling intricate relationships is essential, attention mechanisms have proven to be exceptionally beneficial [20][21][22][23][24] . The key models have demonstrated that attention significantly enhances performance.…”
Section: Attention Mechanismmentioning
confidence: 99%
“…The widespread application of Transformers in the field of Natural Language Processing (NLP) has proven their excellence and convenience in handling sequential data, which has also made them popular for visual tasks [35,36,44]; [45,46]. ViT [35] addresses the high computational cost issue of Transformers in traditional visual tasks by flattening images into a series of pixel blocks (patches), transforming image processing tasks into a form similar to the word sequence processing in NLP.…”
Section: Vision Transformermentioning
confidence: 99%