Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

Feng, Wei; Liu, Wentao; Li, Tong; Peng, Jing; Qian, Chen; Hu, Xiaolin

doi:10.1609/aaai.v33i01.3301898

Cited by 9 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to the cropped instance features, previous methods leverage combined spatial features [3,12,10,14,9,16,46], union box features [34,39], or context features [10,40,30] to improve the accuracy of HOI detection. In order to concentrate on more interactionrelevant features, some methods utilize extra features, such as human pose [37,5,24,14], human parts [47,39,23] and language features [42,9,30,21]. However, the serial architectures of such two-stage methods impair the efficiency of HOI detection.…”

Section: Two-stage Methodsmentioning

confidence: 99%

“…Determining which regions to concentrate on is critical and challenging for HOI detectors. To obtain essential features for interaction prediction, conventional two-stage methods usually involve extra features, e.g., human pose [37,5,24,14] and language [42,9,30,21]. However, even with extra features, two-stage methods still focus on the detected instances that might be inaccurate, which are less adaptive and limited by the detected instances.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Reformulating HOI Detection as Adaptive Set Prediction

Ming-fei¹,

Liao²,

Liu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Determining which image regions to concentrate is critical for Human-Object Interaction (HOI) detection. Conventional HOI detectors focus on either detected human and object pairs or pre-defined interaction locations, which limits learning of the effective features. In this paper, we reformulate HOI detection as an adaptive set prediction problem, with this novel formulation, we propose an Adaptive Set-based one-stage framework (AS-Net) with parallel instance and interaction branches. To attain this, we map a trainable interaction query set to an interaction prediction set with transformer. Each query adaptively aggregates the interaction-relevant features from global contexts through multi-head co-attention. Besides, the training process is supervised adaptively by matching each ground-truth with the interaction prediction. Furthermore, we design an effective instance-aware attention module to introduce instructive features from the instance branch into the interaction branch. Our method outperforms previous stateof-the-art methods without any extra human pose and language features on three challenging HOI detection datasets. Especially, we achieve over 31% relative improvement on a large scale HICO-DET dataset. Code is available at https://github.com/yoyomimi/AS-Net.

show abstract

Section: Two-stage Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Reformulating HOI Detection as Adaptive Set Prediction

Ming-fei¹,

Liao²,

Liu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…GPNN uses a message passing mechanism to reason upon graph structured information [23]. Feng et al proposed a turbo learning method which views human pose and HOI as complementary information to each other and optimize both tasks in an iterative manner [24]. Our proposed HRS explores the geometric relations and action relations between humans and entities.…”

Section: Vrd and Hoi-detmentioning

confidence: 99%

Human-centric Relation Segmentation: Dataset and Solution

Liu

Wang

Gao

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Vision and language understanding techniques have achieved remarkable progress, but currently it is still difficult to well handle problems involving very fine-grained details. For example, when the robot is told to "bring me the book in the girl's left hand", most existing methods would fail if the girl holds one book respectively in her left and right hand. In this work, we introduce a new task named human-centric relation segmentation (HRS), as a fine-grained case of HOI-det. HRS aims to predict the relations between the human and surrounding entities and identify the relation-correlated human parts, which are represented as pixel-level masks. For the above exemplar case, our HRS task produces results in the form of relation triplets girl [left hand], hold, book and exacts segmentation masks of the book, with which the robot can easily accomplish the grabbing task. Correspondingly, we collect a new Person In Context (PIC) dataset for this new task, which contains 17, 122 high-resolution images and densely annotated entity segmentation and relations, including 141 object categories, 23 relation categories and 25 semantic human parts. We also propose a Simultaneous Matching and Segmentation (SMS) framework as a solution to the HRS task. It contains three parallel branches for entity segmentation, subject object matching and human parsing respectively. Specifically, the entity segmentation branch obtains entity masks by dynamically-generated conditional convolutions; the subject object matching branch detects the existence of any relations, links the corresponding subjects and objects by displacement estimation and classifies the interacted human parts; and the human parsing branch generates the pixelwise human part labels. Outputs of the three branches are fused to produce the final HRS results. Extensive experiments on PIC and V-COCO datasets show that the proposed SMS method outperforms baselines with the 36 FPS inference speed. Notably, SMS outperforms the best performing baseline m-KERN with only 17.6% time cost. The dataset and code will be released at http://picdataset.com/challenge/index/.

show abstract

“…Current works pay more attention to exploring how to improve the second stage. The most recent works aim to understand HOI by capturing context information [6,26] or human structural message [25,5,4,32]. Some works [21,27,32] formulated the second stage as a graph reasoning problem and use graph convolutional network to predict the HOI.…”

Section: Related Workmentioning

confidence: 99%

PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection

Liao

Liu

Wang³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes unlikely to form meaningful HOI triplet are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new application-oriented database named as HOI-A, which serves as a good supplement to the existing datasets 1 .

show abstract

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

Cited by 9 publications

References 6 publications

Reformulating HOI Detection as Adaptive Set Prediction

Reformulating HOI Detection as Adaptive Set Prediction

Human-centric Relation Segmentation: Dataset and Solution

PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection

Contact Info

Product

Resources

About