Abstract:We describe a system to detect objects in threedimensional space using video and inertial sensors (accelerometer and gyrometer), ubiquitous in modern mobile platforms from phones to drones. Inertials afford the ability to impose class-specific scale priors for objects, and provide a global orientation reference. A minimal sufficient representation, the posterior of semantic (identity) and syntactic (pose) attributes of objects in space, can be decomposed into a geometric term, which can be maintained by a loca… Show more
“…And in [10], the authors address the localization task from only object observation in a prior semantic map by computing a matrix permanent. The second is SLAM-aided object detection [11,12] and reconstruction [13,14]: [11] develops an 2D object recognition system which is robust to viewpoint changing with the assistance of camera localization, while [12] performs confidence-growing 3D objects detection using visual-inertial measurements. [13,14] reconstruct the dense surface of 3D object by fusing the point cloud from monocular and RGBD SLAM respectively.…”
We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-theart solutions.
“…And in [10], the authors address the localization task from only object observation in a prior semantic map by computing a matrix permanent. The second is SLAM-aided object detection [11,12] and reconstruction [13,14]: [11] develops an 2D object recognition system which is robust to viewpoint changing with the assistance of camera localization, while [12] performs confidence-growing 3D objects detection using visual-inertial measurements. [13,14] reconstruct the dense surface of 3D object by fusing the point cloud from monocular and RGBD SLAM respectively.…”
We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-theart solutions.
“…• Semantic localization and mapping: Although geometric features such as points, lines and planes [151,165] are primarily used in current VINS for localization, these handcrafted features may not be work best for navigation, and it is of importance to be able to learn best features for VINS by leveraging recent advances of deep learning [166]. Moreover, a few recent research efforts have attempted to endow VINS with semantic understanding of environments [167,168,169,170], which is only sparsely explored but holds great potentials.…”
As inertial and visual sensors are becoming ubiquitous, visual-inertial navigation systems (VINS) have prevailed in a wide range of applications from mobile augmented reality to aerial navigation to autonomous driving, in part because of the complementary sensing capabilities and the decreasing costs and size of the sensors. In this paper, we survey thoroughly the research efforts taken in this field and strive to provide a concise but complete review of the related work -which is unfortunately missing in the literature while being greatly demanded by researchers and engineers -in the hope to accelerate the VINS research and beyond in our society as a whole.where δq describes the small rotation that causes the true and estimated attitude to coincide. The advantage of this parametrization permits a minimal representation, 3 × 3 covariance matrix E I θ I θ T , for the attitude uncertainty.
“…Recently, new techniques have emerged to estimate the 3D spatial layout of the objects as well as their occupancy [27,11,2]. These techniques rely on the quality of deep learning object detectors [27,11] or the use of additional range data [2]. Similarly volumetric approaches have been used to construct the layout of objects in rooms, or construct objects and regress their positioning [33].…”
Recent approaches on visual scene understanding attempt to build a scene graph -a computational representation of objects and their pairwise relationships. Such rich semantic representation is very appealing, yet difficult to obtain from a single image, especially when considering complex spatial arrangements in the scene. Differently, an image sequence conveys useful information using the multi-view geometric relations arising from camera motions. Indeed, object relationships are naturally related to the 3D scene structure. To this end, this paper proposes a system that first computes the geometrical location of objects in a generic scene and then efficiently constructs scene graphs from video by embedding such geometrical reasoning. Such compelling representation is obtained using a new model where geometric and visual features are merged using an RNN framework. We report results on a dataset we created for the task of 3D scene graph generation in multiple views.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.