We present a novel approach for estimating body part appearance models for pictorial structures. We learn latent relationships between the appearance of different body parts from annotated images, which then help in estimating better appearance models on novel images. The learned appearance models are general, in that they can be plugged into any pictorial structure engine. In a comprehensive evaluation we demonstrate the benefits brought by the new appearance models to an existing articulated human pose estimation algorithm, on hundreds of highly challenging images from the TV series Buffy the vampire slayer and the PASCAL VOC 2008 challenge.
We present a technique for estimating the spatial layout of humans in still images-the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection models.
Abstract. We present a novel multi-person pose estimation framework, which extends pictorial structures (PS) to explicitly model interactions between people and to estimate their poses jointly. Interactions are modeled as occlusions between people. First, we propose an occlusion probability predictor, based on the location of persons automatically detected in the image, and incorporate the predictions as occlusion priors into our multi-person PS model. Moreover, our model includes an inter-people exclusion penalty, preventing body parts from different people from occupying the same image region. Thanks to these elements, our model has a global view of the scene, resulting in better pose estimates in group photos, where several persons stand nearby and occlude each other. In a comprehensive evaluation on a new, challenging group photo datasets we demonstrate the benefits of our multi-person model over a state-of-the-art single-person pose estimator which treats each person independently.
The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction and also spatially localize the relevant people. We make the following four contributions: (i) head detection with implicit coarse pose information (front, profile, back); (ii) continuous head pose estimation in unconstrained scenarios (TV video) using Gaussian Process regression; (iii) propose and evaluate several methods for assessing whether and when pairs of people are looking at each other in a video shot; and (iv) introduce new ground truth annotation for this task, extending the TV Human Interactions Dataset [28]. The performance of the methods is evaluated on this dataset, which consists of 300 video clips extracted from TV shows. Despite the variety and difficulty of this video material, our best method obtains an average precision of 87.6% in a fully automatic manner.
Generalised wide are search and surveillance is a common-place tasking for multi-sensory equipped autonomous systems. Here we present on a key supporting topic to this task -the automatic interpretation, fusion and detected target reporting from multi-modal sensor information received from multiple autonomous platforms deployed for wide-area environment search. We detail the realization of a real-time methodology for the automated detection of people and vehicles using combined visible-band (EO), thermal-band (IR) and radar sensing from a deployed network of multiple autonomous platforms (ground and aerial). This facilities real-time target detection, reported with varying levels of confidence, using information from both multiple sensors and multiple sensor platforms to provide environment-wide situational awareness. A range of automatic classification approaches are proposed, driven by underlying machine learning techniques, that facilitate the automatic detection of either target type with cross-modal target confirmation. Extended results are presented that show both the detection of people and vehicles under varying conditions in both isolated rural and cluttered urban environments with minimal false positive detection. Performance evaluation is presented at an episodic level with individual classifiers optimized for maximal each object of interest (vehicle/person) detection over a given search path/pattern of the environment, across all sensors and modalities, rather than on a per sensor sample basis. Episodic target detection, evaluated over a number of wide-area environment search and reporting tasks, generally exceeds 90%+ for the targets considered here.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.