Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition

Gupta, Abhinav; Kembhavi, Aniruddha; Davis, Larry S.

doi:10.1109/tpami.2009.83

Cited by 486 publications

(409 citation statements)

References 44 publications

Supporting

Mentioning

399

Contrasting

Order By: Relevance

“…Visual Genome is the first large-scale visual relationship dataset. This dataset can be used to study the extraction of visual relationships (Sadeghi et al 2015) from images, and its interactions between objects can also be used to study action recognition (Yao and Fei-Fei 2010;Ramanathan et al 2015) and spatial orientation between objects (Gupta et al 2009;Prest et al 2012).…”

Section: Relationship Extractionmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

4,158

3,078

View full text Add to dashboard Cite

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest

show abstract

Section: Relationship Extractionmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

4,158

3,078

View full text Add to dashboard Cite

show abstract

“…Such strategies benefit from using the global image content, thus not suffering from low-quality appearance, small objects, or occlusions. The object-action context is addressed in [17,23,11,38,15] while spatial coherence constraints may be enforced as well [11].…”

Section: Related Workmentioning

confidence: 99%

Recognition of Group Activities in Videos Based on Single-and Two-Person Descriptors

Lathuiliere,

Evangelidis,

Horaud

2017

2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

Group activity recognition from videos is a very challenging problem that has barely been addressed. We propose an activity recognition method using group context. In order to encode both single-person description and two-person interactions, we learn mappings from highdimensional feature spaces to low-dimensional dictionaries. In particular the proposed two-person descriptor takes into account geometric characteristics of the relative pose and motion between the two persons. Both single-person and two-person representations are then used to define unary and pairwise potentials of an energy function, whose optimization leads to the structured labeling of persons involved in the same activity. An interesting feature of the proposed method is that, unlike the vast majority of existing methods, it is able to recognize multiple distinct group activities occurring simultaneously in a video. The proposed method is evaluated with datasets widely used for group activity recognition, and is compared with several baseline methods.

show abstract

“…Despite many successes achieved by these methods, we argue that invariant feature sets are insufficient alone for this complicated task, since most of them can only provide partial invariance -some address this type of variations and others address that but not all; and even with these feature sets, lots of prototypes are still needed to cover the huge range of the variability exhibited in the pose space of human body, not to mention such a representation is usually with high dimension. To deal with these issues, some authors proposed to enhance the stability of feature sets using various context information (if available), such as human-object context [13][14][15] or group context [16,11,1], or using a multiple cues based approach to combine the strength of different features [2]. Recently, Wang et al introduce a method which relies on more semantically meaningful features (i.e., pose-lets) and arrange them in a hierarchical manner to improve the invariance and discriminative power of the feature representation [3], and achieves the state of the art performance on a challenging web data set with still images [17].…”

Section: Related Workmentioning

confidence: 99%

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Zhang

Tan

Jin

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

Abstract. In this paper, we present a new method which estimates the pose of a human body and identifies its action from one single static image. This is a challenging task due to the high degrees of freedom of body poses and lack of any motion cues. Specifically, we build a pool of pose experts, each of which individually models a particular type of articulation for a group of human bodies with similar poses or semantics (actions). We investigate two ways to construct these pose experts and show that this method leads to improved pose estimation performance under difficult conditions. Furthermore, in contrast to previous wisdoms of combining the output of each pose expert for action recognition using such method as majority voting, we propose a flexible strategy which adaptively integrates them in a discriminative framework, allowing each pose expert to adjust their roles in action prediction according to their specificity when facing different action types. In particular, the spatial relationship between estimated part locations from each expert is encoded in a graph structure, capturing both the non-local and local spatial correlation of the body shape. Each graph is then treated as a separate group, on which an overall group sparse constraint is imposed to train the prediction model, with extra weight added according to the confidence of the corresponding expert. We show in our experiments on a challenging web data set with state of the art results that our method effectively improves the tolerance of our system to imperfect pose estimation.

show abstract

Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition

Cited by 486 publications

References 44 publications

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Recognition of Group Activities in Videos Based on Single-and Two-Person Descriptors

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Contact Info

Product

Resources

About