2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00579
|View full text |Cite
|
Sign up to set email alerts
|

Deep Contextual Attention for Human-Object Interaction Detection

Abstract: Human-object interaction detection is an important and relatively new class of visual relationship detection tasks, essential for deeper scene understanding. Most existing approaches decompose the problem into object localization and interaction recognition. Despite showing progress, these approaches only rely on the appearances of humans and objects and overlook the available context information, crucial for capturing subtle interactions between them. We propose a contextual attention framework for human-obje… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
89
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 130 publications
(93 citation statements)
references
References 28 publications
0
89
0
Order By: Relevance
“…The other edges are defined based on the consistencies among objects, actions and interactions. That is, if two [25] ResNet-50 48.7 BAR-CNN [21] Inception-ResNet 43.6 Wang et al [41] ResNet-50 47.3 PMFNet [39] ResNet-50 52.0 VSGNet [37] ResNet-152 51.8…”
Section: Semantic Embedding Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…The other edges are defined based on the consistencies among objects, actions and interactions. That is, if two [25] ResNet-50 48.7 BAR-CNN [21] Inception-ResNet 43.6 Wang et al [41] ResNet-50 47.3 PMFNet [39] ResNet-50 52.0 VSGNet [37] ResNet-152 51.8…”
Section: Semantic Embedding Networkmentioning
confidence: 99%
“…Most existing works on HOI detection [9,11,14,25,36,39,41] treat HOIs as individual interaction categories and focus on mining visual representations of human-object pairs to improve classification performances. Despite previous successes, these conventional Figure 2: Polysemy of action labels.…”
Section: Introductionmentioning
confidence: 99%
“…Though impressive solutions have been devised using deep models for human action recognition, it is yet a challenging task to discriminate various fine-grained activities like playing a violin vs a guitar, using a phone vs talking to passengers while driving, etc. It can be regarded as a more challenging problem when multiple actions appear in a single image, such as walking and talking over the phone [8].…”
Section: Introductionmentioning
confidence: 99%
“…Most previous methods either rely on employing handcrafted visual features [13,24,47,77], such as color and shape, or using mid-level holistic image representations [11,33,44,75,76] constructed by encoding hand-crafted visual features. Recently, deep Convolutional Neural Networks (CNNs) have revolutionized computer vision by significantly advancing the state-of-the-art in many areas such as, image classification [8,20,31,38,43,64], object detection/segmentation [19,28,30,40,53,57,68,69,81] and action recognition [26,48,51,63]. Similarly, deep learning techniques have also made an impact on satellite image analysis, including aerial scene classification [2,32,54,73] and hyperspectral image analysis [14,15,66].…”
Section: Introductionmentioning
confidence: 99%