We present an approach to real-time person tracking in crowded and/or unknown environments using multi-modal integration. We combine stereo, color, and face detection modules into a single robust system, and show an initial application in an interactive, face-responsive display. Dense, real-time stereo processing is used to isolate users from other objects and people in the background. Skin-hue classification identifies and tracks likely body parts within the silhouette of a user. Face pattern detection discriminates and localizes the face within the identified body parts. Faces and bodies of users are tracked over several temporal scales: short-term (user stays within the field of view), medium-term (user exits/reenters within minutes), and long term (user returns after hours or days). Short-term tracking is performed using simple region position and size correspondences, while medium and long-term tracking are based on statistics of user appearance. We discuss the failure modes of each individual module, describe our integration method, and report results with the complete system in trials with thousands of users.
Abstract.
Copyright 2002 Springer-Verlag. Published in the 7th European Conference on Computer Vision (ECCV-2002), May 28-31, 2002, Copenhagen, Denmark. Personal Time-Adaptive, Per-Pixel Mixtures Of Gaussians (TAPPMOGs) have recently become a popular choice for robust modeling and removal of complex and changing backgrounds at the pixel level. However, TAPPMOG-based methods cannot easily be made to model dynamic backgrounds with highly complex appearance, or to adapt promptly to sudden "uninteresting" scene changes such as the re-positioning of a static object or the turning on of a light, without further undermining their ability to segment foreground objects, such as people, where they occlude the background for too long. To alleviate tradeoffs such as these, and, more broadly, to allow TAPPMOG segmentation results to be tailored to the specific needs of an application, we introduce a general framework for guiding pixel-level TAPPMOG evolution with feedback from "high-level" modules. Each such module can use pixel-wise maps of positive and negative feedback to attempt to impress upon the TAPPMOG some definition of foreground that is best expressed through "higher-level" primitives such as image region properties or semantics of objects and events. By pooling the foreground error corrections of many high-level modules into a shared, pixel-level TAPPMOG model in this way, we improve the quality of the foreground segmentation and the performance of all modules that make use of it. We show an example of using this framework with a TAPPMOG method and high-level modules that all rely on dense depth data from a stereo camera.
Segmentation of novel or dynamic objects in a scene, often referred to as "background subtraction" or "joreground segmentation", is a critical early in step in most computer vision applications in domains such as surveillance and human-computer interaction. All previously described, real-time methods fail to handle properly one or more common phenomena, such as global illumination changes, shadows, inter-rejections, similarity of foreground color to background, and non-static backgrounds (e.g. active video displays or trees waving in the wind). The recent advent of hardware and software for real-time computation of depth imagery makes better approaches possible. We propose a method for modeling the background that uses per-pixel, time-adaptive, Gaussian mixtures in the combined input space of depth and luminance-invariant colol: This combination in itself is novel, but we further improve it by introducing the ideas of I) modulating the background model learning rate based on scene activity, and 2 ) making colorbased segmentation criteria dependent on depth observations. Our experiments show that the method possesses much greater robustness to problematic phenomena than the prior state-of-the-art, without sacrijicing real-time performance, making it well-suited for a wide range of practical applications in video event detection and recognition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.