Volumetric methods provide efficient, flexible and simple ways of integrating multiple depth images into a full 3D model. They provide dense and photorealistic 3D reconstructions, and parallelised implementations on GPUs achieve real-time performance on modern graphics hardware. To run such methods on mobile devices, providing users with freedom of movement and instantaneous reconstruction feedback, remains challenging however. In this paper we present a range of modifications to existing volumetric integration methods based on voxel block hashing, considerably improving their performance and making them applicable to tablet computer applications. We present (i) optimisations for the basic data structure, and its allocation and integration; (ii) a highly optimised raycasting pipeline; and (iii) extensions to the camera tracker to incorporate IMU data. In total, our system thus achieves frame rates up 47 Hz on a Nvidia Shield Tablet and 910 Hz on a Nvidia GTX Titan XGPU, or even beyond 1.1 kHz without visualisation.
Abstract-Our abilities in scene understanding, which allow us to perceive the 3D structure of our surroundings and intuitively recognise the objects we see, are things that we largely take for granted, but for robots, the task of understanding large scenes quickly remains extremely challenging. Recently, scene understanding approaches based on 3D reconstruction and semantic segmentation have become popular, but existing methods either do not scale, fail outdoors, provide only sparse reconstructions or are rather slow. In this paper, we build on a recent hash-based technique for large-scale fusion and an efficient mean-field inference algorithm for densely-connected CRFs to present what to our knowledge is the first system that can perform dense, large-scale, outdoor semantic reconstruction of a scene in (near) real time. We also present a 'semantic fusion' approach that allows us to handle dynamic objects more effectively than previous approaches. We demonstrate the effectiveness of our approach on the KITTI dataset, and provide qualitative and quantitative results showing high-quality dense reconstruction and labelling of a number of scenes.
In the highly active research field of dense 3D reconstruction and modelling, loop closure is still a largely unsolved problem. While a number of previous works show how to accumulate keyframes, globally optimize their pose on closure, and compute a dense 3D model as a post-processing step, in this paper we propose an online framework which delivers a consistent 3D model to the user in real time. This is achieved by splitting the scene into submaps, and adjusting the poses of the submaps as and when required. We present a novel technique for accumulating relative pose constraints between the submaps at very little computational cost, and demonstrate how to maintain a lightweight, scalable global optimization of submap poses. In contrast to previous works, the number of submaps grows with the observed 3D scene surface, rather than with time. In addition to loop closure, the paper incorporates relocalization and provides a novel way of assessing tracking quality.
Many modern 3D reconstruction methods accumulate information volumetrically using truncated signed distance functions. While this usually imposes a regular grid with fixed voxel size, not all parts of a scene necessarily need to be represented at the same level of detail. For example, a flat table needs less detail than a highly structured keyboard on it. We introduce a novel representation for the volumetric 3D data that uses hash functions rather than trees for accessing individual blocks of the scene, but which still provides different resolution levels. We show that our data structure provides efficient access and manipulation functions that can be very well parallelised, and also describe an automatic way of choosing appropriate resolutions for different parts of the scene. We embed the novel representation in a system for simultaneous localisation and mapping from RGB-D imagery and also investigate the implications of the irregular grid on interpolation routines. Finally we evaluate our system in experiments, demonstrating state-of-the-art representation accuracy at typical framerates around 100 Hz, along with 40% memory savings.
Without knowledge of the absolute baseline between images, the scale of a map from single-camera simultaneous localization and mapping system is subject to calamitous drift over time. We describe a monocular approach that in addition to point measurements also considers object detections to resolve this scale ambiguity and drift. By placing a prior on the size of the objects, the scale estimation can be seamlessly integrated into a bundle adjustment. When object observations are available, the local scale of the map is then determined jointly with the camera pose in local adjustments. Unlike many previous visual odometry methods, our approach does not impose restrictions such as approximately constant camera height or planar roadways, and is therefore applicable to a much wider range of applications. We evaluate our approach on the KITTI dataset and show that it reduces scale drift over long-range outdoor sequences with a total length of 40 km. Qualitative evaluation is also performed on video footage from a hand-held camera.
We present a novel framework for jointly tracking a camera in 3D and reconstructing the 3D model of an observed object. Due to the region based approach, our formulation can handle untextured objects, partial occlusions, motion blur, dynamic backgrounds and imperfect lighting. Our formulation also allows for a very efficient implementation which achieves real-time performance on a mobile phone, by running the pose estimation and the shape optimisation in parallel. We use a level set based pose estimation but completely avoid the, typically required, explicit computation of a global distance. This leads to tracking rates of more than 100 Hz on a desktop PC and 30 Hz on a mobile phone. Further, we incorporate additional orientation information from the phone's inertial sensor which helps us resolve the tracking ambiguities inherent to region based formulations. The reconstruction step first probabilistically integrates 2D image statistics from selected keyframes into a 3D volume, and then imposes coherency and compactness using a total variational regularisation term. The global optimum of the overall energy function is found using a continuous max-flow algorithm and we show that, similar to tracking, the integration of per voxel posteriors instead of likelihoods improves the precision and accuracy of the reconstruction.
We studied the applicability of point clouds derived from tri-stereo satellite imagery for semantic segmentation for generalized sparse convolutional neural networks by the example of an Austrian study area. We examined, in particular, if the distorted geometric information, in addition to color, influences the performance of segmenting clutter, roads, buildings, trees, and vehicles. In this regard, we trained a fully convolutional neural network that uses generalized sparse convolution one time solely on 3D geometric information (i.e., 3D point cloud derived by dense image matching), and twice on 3D geometric as well as color information. In the first experiment, we did not use class weights, whereas in the second we did. We compared the results with a fully convolutional neural network that was trained on a 2D orthophoto, and a decision tree that was once trained on hand-crafted 3D geometric features, and once trained on hand-crafted 3D geometric as well as color features. The decision tree using hand-crafted features has been successfully applied to aerial laser scanning data in the literature. Hence, we compared our main interest of study, a representation learning technique, with another representation learning technique, and a non-representation learning technique. Our study area is located in Waldviertel, a region in Lower Austria. The territory is a hilly region covered mainly by forests, agriculture, and grasslands. Our classes of interest are heavily unbalanced. However, we did not use any data augmentation techniques to counter overfitting. For our study area, we reported that geometric and color information only improves the performance of the Generalized Sparse Convolutional Neural Network (GSCNN) on the dominant class, which leads to a higher overall performance in our case. We also found that training the network with median class weighting partially reverts the effects of adding color. The network also started to learn the classes with lower occurrences. The fully convolutional neural network that was trained on the 2D orthophoto generally outperforms the other two with a kappa score of over 90% and an average per class accuracy of 61%. However, the decision tree trained on colors and hand-crafted geometric features has a 2% higher accuracy for roads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.