Scaling Human-Object Interaction Recognition Through Zero-Shot Learning

Shen, Liyue; Yeung, Serena; Hoffman, Judy; Mori, Greg; Li, Feifei

doi:10.1109/wacv.2018.00181

Cited by 146 publications

(130 citation statements)

References 17 publications

Supporting

Mentioning

130

Contrasting

Order By: Relevance

“…However, with the introduction of datasets with a larger vocabulary of objects and predicates [6,23], visual phrase approaches have been facing severe difficulties as most relations have very few training examples. Compositional methods [9,11,17,27,30,33,42], which allow sharing knowledge across triplets, have scaled better but do not cope well with unseen relations. To increase the expressiveness of the generic compositional detectors, recent works have developed models of statistical dependencies between the subject, object and predicate, using, for example, graphical models [7,24], language distillation [45], or semantic context [48].…”

Section: Related Workmentioning

confidence: 99%

Detecting Unseen Visual Relations Using Analogies

Peyre¹,

Šivic²,

Laptev³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

119

146

View full text Add to dashboard Cite

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training. This is an important set-up due to the combinatorial nature of visual relations : collecting sufficient training data for all possible triplets would be very hard. The contributions of this work are three-fold. First, we learn a representation of visual relations that combines (i) individual embeddings for subject, object and predicate together with (ii) a visual phrase embedding that represents the relation triplet. Second, we learn how to transfer visual phrase embeddings from existing training triplets to unseen test triplets using analogies between relations that involve similar objects. Third, we demonstrate the benefits of our approach on three challenging datasets : on HICO-DET, our model achieves significant improvement over a strong baseline for both frequent and unseen triplets, and we observe similar improvement for the retrieval of unseen triplets with out-ofvocabulary predicates on the COCO-a dataset as well as the challenging unusual triplets in the UnRel dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

Detecting Unseen Visual Relations Using Analogies

Peyre¹,

Šivic²,

Laptev³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

119

146

View full text Add to dashboard Cite

show abstract

“…Chao et al [17] set the benchmark in HICO-DET based on a three-stream detection framework, exploiting the visual and spatial representations of human, object and the pairwise bounding box. Shen et al [32] analyzed the zero-shot problem with separate verb and object detection losses. Zhuang et al [23] addressed the long-tail issue with supervision from web data.…”

Section: Related Workmentioning

confidence: 99%

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

Wong

et al. 2020

IEEE Trans. Multimedia

View full text Add to dashboard Cite

The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, the ability to fully comprehend a social scene is still in its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) in social scene images, which is demanding in terms of research and increasingly useful for practical applications. To undertake social tasks interacting with objects, humans direct their attention and move their body based on their intention. Based on this observation, we provide an unique computational perspective to explore human intention in HOI detection. Specifically, the proposed human intentiondriven HOI detection (iHOI) framework models human pose with the relative distances from body joints to the object instances. It also utilizes human gaze to guide the attended contextual regions in a weakly-supervised setting. In addition, we propose a hard negative sampling strategy to address the problem of misgrouping. We perform extensive experiments on two benchmark datasets, namely V-COCO and HICO-DET, and show that iHOI outperforms the existing approaches. The efficacy of each proposed component has also been validated.

show abstract

“…Desai and Ramanan (2012) propose a compositional model that uses human pose and interacting objects to predict human actions, but the visual phraselets and tree structure they use are too simple to capture sophisticated HOI relations in large datasets. In connection with neural networks, Shen et al (2018) ...…”

Section: Combination Of Action Recognition and Pose Estimationmentioning

confidence: 99%

“…The proposed method achieves the state-of-the-art results on two public benchmarks including V-COCO and HICO-DET use spatial relations between human and object positions to recognize HOIs. Shen et al (2018) focus on the difficulty of obtaining all the possible HOI samples in reality, and propose a zero-shot learning method to tackle with the lack of data problem.…”

Section: Introductionmentioning

confidence: 99%

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

Feng

Liu

et al. 2019

AAAI

View full text Add to dashboard Cite

Human-object interactions (HOI) recognition and pose estimation are two closely related tasks. Human pose is an essential cue for recognizing actions and localizing the interacted objects. Meanwhile, human action and their interacted objects' localizations provide guidance for pose estimation. In this paper, we propose a turbo learning framework to perform HOI recognition and pose estimation simultaneously. First, two modules are designed to enforce message passing between the tasks, i.e. pose aware HOI recognition module and HOI guided pose estimation module. Then, these two modules form a closed loop to utilize the complementary information iteratively, which can be trained in an end-to-end manner. The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.

show abstract

Scaling Human-Object Interaction Recognition Through Zero-Shot Learning

Cited by 146 publications

References 17 publications

Detecting Unseen Visual Relations Using Analogies

Detecting Unseen Visual Relations Using Analogies

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

Contact Info

Product

Resources

About