A framework is presented for estimating the pose of a camera based on images extracted from a single omnidirectional image of an urban scene, given a 2D map with building outlines with no 3D geometric information nor appearance data. The framework attempts to identify vertical corner edges of buildings in the query image, which we term VCLH, as well as the neighboring plane normals, through vanishing point analysis. A bottom-up process further groups VCLH into elemental planes and subsequently into 3D structural fragments modulo a similarity transformation. A geometric hashing lookup allows us to rapidly establish multiple candidate correspondences between the structural fragments and the 2D map building contours. A voting-based camera pose estimation method is then employed to recover the correspondences admitting a camera pose solution with high consensus. In a dataset that is even challenging for humans, the system returned a top-30 ranking for correct matches out of 3600 camera pose hypotheses (0.83% selectivity) for 50.9% of queries.
Visual saliency is a computational process that identifies important locations and structure in the visual field. Most current methods for saliency rely on cues such as color and texture while ignoring depth information, which is known to be an important saliency cue in the human cognitive system. We propose a novel computational model of visual saliency which incorporates depth information. We compare our approach to several state of the art visual saliency methods and we introduce a method for saliency based segmentation of generic objects. We demonstrate that by explicitly constructing 3D layout and shape features from depth measurements, we can obtain better performance than methods which treat the depth map as just another image channel. Our method requires no learning and can operate on scenes for which the system has no previous knowledge. We conduct object segmentation experiments on a new dataset of registered RGB-D images captured on a mobile-manipulator robot.
We present a novel action representation based on encoding the global temporal movement of an action. We represent an action as a set of movement pattern histograms that encode the global temporal dynamics of an action. Our key observation is that temporal dynamics of an action are robust to variations in appearance and viewpoint changes, making it useful for action recognition and retrieval. We pose the problem of computing similarity between action representations as a maximum matching problem in a bipartite graph. We demonstrate the effectiveness of our method for cross-view action recognition on the IXMAS dataset. We also show how our representation complements existing bagof-features representations on the UCF50 dataset. Finally we show the power of our representation for action retrieval on a new real-world dataset containing repetitive motor movements emitted by children with autism in an unconstrained classroom setting.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.