We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling. arXiv:1611.05009v4 [cs.CV] 10 Apr 2017 naïvely. We illustrate this in Fig. 1 for a 3D classification example. Given the 3D meshes of [48] we voxelize the input at a resolution of 64 3 and train a simple 3D convolutional network to minimize a classification loss. We depict the maximum of the responses across all feature maps at different layers of the network. It is easy to observe that high activations occur only near the object boundaries.Motivated by this observation, we propose OctNet, a 3D convolutional network that exploits this sparsity property. Our OctNet hierarchically partitions the 3D space into a set of unbalanced octrees [32]. Each octree splits the 3D space according to the density of the data. More specifically, we recursively split octree nodes that contain a data point in its domain, i.e., 3D points, or mesh triangles, stopping at the finest resolution of the tree. Therefore, leaf nodes vary in size, e.g., an empty leaf node may comprise up to 8 3 = 512 voxels for a tree of depth 3 and each leaf node in the octree stores a pooled summary of all feature activations of the voxel it comprises. The convolutional network operations are directly defined on the structure of these trees. Therefore, our network dynamically focuses computational and memory resources, depending on the 3D structure of the input. This leads to a significant reduction in computational and memory requirements which allows for deep learning at high resolutions. Importantly, we also show how essential network operations (convolution, pooling or unpooling) can be efficiently implemented on this new data structure.We demonstrate the utility of the proposed OctNet on three different problems involving three-dimensional data: 3D classification, 3D orientation estimation of unknown object instances and semantic segmentation of 3D point clouds. In particular, we show that the proposed OctNet enables significant higher input resolutions compared to dense inputs due to its lower memory consumption, while achieving identical performance compared to the equivalent dense network at lower resolutions. At the same time we gain significant speed-ups at resolutions of 128 3 and above. Using our OctNet, we investigate the impact of high resolution inputs wrt. accuracy on the three tasks and demonstrate that higher resol...
Figure 1: Stable View Synthesis synthesizes spatially and temporally coherent photorealistic views of complex real-world scenes. Top and left: new views of scenes from the Tanks and Temples dataset [18]. Bottom right: a new view of a scene from the FVS dataset [29].
In this paper, we present a learning based approach to depth fusion, i.e., dense 3D reconstruction from multiple depth images. The most common approach to depth fusion is based on averaging truncated signed distance functions, which was originally proposed by Curless and Levoy in 1996. While this method is simple and provides great results, it is not able to reconstruct (partially) occluded surfaces and requires a large number frames to filter out sensor noise and outliers. Motivated by the availability of large 3D model repositories and recent advances in deep learning, we present a novel 3D CNN architecture that learns to predict an implicit surface representation from the input depth maps. Our learning based method significantly outperforms the traditional volumetric fusion approach in terms of noise reduction and outlier suppression. By learning the structure of real world 3D objects and scenes, our approach is further able to reconstruct occluded regions and to fill in gaps in the reconstruction. We demonstrate that our learning based approach outperforms both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric fusion. Further, we demonstrate state-of-the-art 3D shape completion results.
Abstract. In this work we present a novel approach for single depth map super-resolution. Modern consumer depth sensors, especially Timeof-Flight sensors, produce dense depth measurements, but are affected by noise and have a low lateral resolution. We propose a method that combines the benefits of recent advances in machine learning based single image super-resolution, i.e. deep convolutional networks, with a variational method to recover accurate high-resolution depth maps. In particular, we integrate a variational method that models the piecewise affine structures apparent in depth data via an anisotropic total generalized variation regularization term on top of a deep network. We call our method ATGV-Net and train it end-to-end by unrolling the optimization procedure of the variational method. To train deep networks, a large corpus of training data with accurate ground-truth is required. We demonstrate that it is feasible to train our method solely on synthetic data that we generate in large quantities for this task. Our evaluations show that we achieve state-of-the-art results on three different benchmarks, as well as on a challenging Time-of-Flight dataset, all without utilizing an additional intensity image as guidance.
No abstract
While many recent hand pose estimation methods critically rely on a training set of labelled frames, the creation of such a dataset is a challenging task that has been overlooked so far. As a result, existing datasets are limited to a few sequences and individuals, with limited accuracy, and this prevents these methods from delivering their full potential. We propose a semi-automated method for efficiently and accurately labeling each frame of a hand depth video with the corresponding 3D locations of the joints: The user is asked to provide only an estimate of the 2D reprojections of the visible joints in some reference frames, which are automatically selected to minimize the labeling work by efficiently optimizing a sub-modular loss function. We then exploit spatial, temporal, and appearance constraints to retrieve the full 3D poses of the hand over the complete sequence. We show that this data can be used to train a recent state-of-the-art hand pose estimation method, leading to increased accuracy. The code and dataset can be found at https://github.com/ moberweger/semi-auto-anno/.
In this paper we present a novel method to increase the spatial resolution of depth images. We combine a deep fully convolutional network with a non-local variational method in a deep primal-dual network. The joint network computes a noise-free, highresolution estimate from a noisy, low-resolution input depth map. Additionally, a highresolution intensity image is used to guide the reconstruction in the network. By unrolling the optimization steps of a first-order primal-dual algorithm and formulating it as a network, we can train our joint method end-to-end. This not only enables us to learn the weights of the fully convolutional network, but also to optimize all parameters of the variational method and its optimization procedure. The training of such a deep network requires a large dataset for supervision. Therefore, we generate high-quality depth maps and corresponding color images with a physically based renderer. In an exhaustive evaluation we show that our method outperforms the state-of-the-art on multiple benchmarks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.