Recent work on depth estimation up to now has only focused on projective images ignoring 360 o content which is now increasingly and more easily produced. We show that monocular depth estimation models trained on traditional images produce sub-optimal results on omnidirectional images, showcasing the need for training directly on 360 o datasets, which however, are hard to acquire. In this work, we circumvent the challenges associated with acquiring high quality 360 o datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360 o via rendering. This dataset, which is considerably larger than similar projective datasets, is publicly offered to the community to enable future research in this direction. We use this dataset to learn in an end-to-end fashion the task of depth estimation from 360 o images. We show promising results in our synthesized data as well as in unseen realistic images.
Learning based approaches for depth perception are limited by the availability of clean training data. This has led to the utilization of view synthesis as an indirect objective for learning depth estimation using efficient data acquisition procedures. Nonetheless, most research focuses on pinhole based monocular vision, with scarce works presenting results for omnidirectional input. In this work, we explore spherical view synthesis for learning monocular 360 o depth in a self-supervised manner and demonstrate its feasibility. Under a purely geometrically derived formulation we present results for horizontal and vertical baselines, as well as for the trinocular case. Further, we show how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner. Finally, given the availability of ground truth depth data, our work is uniquely positioned to compare view synthesis against direct supervision in a consistent and fair manner. The results indicate that alternative research directions might be better suited to enable higher quality depth perception. Our data, models and code are publicly available at https
Abstract-The latest developments in 3D capturing, processing, and rendering provide means to unlock novel 3D application pathways. The main elements of an integrated platform, which target tele-immersion and future 3D applications, are described in this paper, addressing the tasks of real-time capturing, robust 3D human shape/appearance reconstruction, and skeletonbased motion tracking. More specifically, initially, the details of a multiple RGB-depth (RGB-D) capturing system are given, along with a novel sensors' calibration method. A robust, fast reconstruction method from multiple RGB-D streams is then proposed, based on an enhanced variation of the volumetric Fourier transform-based method, parallelized on the Graphics Processing Unit, and accompanied with an appropriate texturemapping algorithm. On top of that, given the lack of relevant objective evaluation methods, a novel framework is proposed for the quantitative evaluation of real-time 3D reconstruction systems. Finally, a generic, multiple depth stream-based method for accurate real-time human skeleton tracking is proposed. Detailed experimental results with multi-Kinect2 data sets verify the validity of our arguments and the effectiveness of the proposed system and methodologies.
The focus of research into 5G networks to date has been largely on the required advances in network architectures, technologies, and infrastructures. Less effort has been put on the applications and services that will make use of and exploit the flexibility of 5G networks built upon the concept of software-defined networking (SDN) and network function virtualization (NFV). Media-based applications are amongst the most demanding services, requiring large bandwidths for high Manuscript
Multi-view capture systems are complex systems to engineer. They require technical knowledge to install and complex processes to setup. However, with the ongoing developments in new production methods, we are now at a position to be able to generate high quality realistic 3D assets. Nonetheless, the capturing systems developed with these methods are intertwined with them, relying on custom solutions and seldom -if not at all -publicly available. We design, develop and publicly offer a multi-view capture system based on the latest RGB-D sensor technology. We also develop a portable and easy-to-use external calibration process to allow for its widespread use.
Depth perception is considered an invaluable source of information for various vision tasks. However, depth maps acquired using consumer-level sensors still suffer from non-negligible noise. This fact has recently motivated researchers to exploit traditional filters, as well as the deep learning paradigm, in order to suppress the aforementioned non-uniform noise, while preserving geometric details. Despite the effort, deep depth denoising is still an open challenge mainly due to the lack of clean data that could be used as ground truth. In this paper, we propose a fully convolutional deep autoencoder that learns to denoise depth maps, surpassing the lack of ground truth data. Specifically, the proposed autoencoder exploits multiple views of the same scene from different points of view in order to learn to suppress noise in a self-supervised endto-end manner using depth and color information during training, yet only depth during inference. To enforce selfsupervision, we leverage a differentiable rendering technique to exploit photometric supervision, which is further regularized using geometric and surface priors. As the proposed approach relies on raw data acquisition, a large RGB-D corpus is collected using Intel RealSense sensors. Complementary to a quantitative evaluation, we demonstrate the effectiveness of the proposed self-supervised denoising approach on established 3D reconstruction applications. Code is avalable at https://github.com/ VCL3D/DeepDepthDenoising
Free-viewpoint capture technologies have recently started demonstrating impressive results. Being able to capture human performances in full 3D is a very promising technology for a variety of applications. However, the setup of the capturing infrastructure is usually expensive and requires trained personnel. In this work we focus on one practical aspect of setting up a free-viewpoint capturing system, the spatial alignment of the sensors. Our work aims at simplifying the external calibration process that typically requires significant human intervention and technical knowledge. Our method uses an easy to assemble structure and unlike similar works, does not rely on markers or features. Instead, we exploit the a-priori knowledge of the structure's geometry to establish correspondences for the little-overlapping viewpoints typically found in free-viewpoint capture setups. These establish an initial sparse alignment that is then densely optimized. At the same time, our pipeline improves the robustness to assembly errors, allowing for non-technical users to calibrate multi-sensor setups. Our results showcase the feasibility of our approach that can make the tedious calibration process easier, and less error-prone.
While studies for objective and subjective evaluation of the visual quality of compressed 3D meshes has been discussed in the literature, those studies were covering the evaluation of 3D-meshes that were created either by 3D artists or generated by a computationally expensive 3D reconstruction process applied on high quality 3D scans. With the advent of RGB-D sensors that operate at high frame-rates and the utilization of fast 3D reconstruction algorithms, humans can be captured and 3D reconstructed into a 3D mesh representation in real-time, enabling new (tele-)immersive experiences. The way of producing the respective 3D mesh content is dramatically different between the two cases, leading to apparent structural difference between the output meshes. On one hand, the first type of content is nearly perfect and clean, while on the other hand, the second type is much more irregular and noisy. Thus, evaluating compression artifacts on this new type of immersive 3D media, constitutes a yet unexplored scientific area. In this paper, we aim to subjectively assess the compression artifacts introduced by three open-source static 3D mesh codecs, when compressing 3D meshes generated for immersive experiences. The subjective evaluation of the content is conducted in a Virtual Reality setting, using the forced-choice pairwise comparison methodology with existing reference. The results of this study is a mapping of the compared conditions to a continuous ranking scale that can be used in order to optimize codec choice and compression parameters to achieve optimum balance between bandwidth and perceived quality in tele-immersive platforms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.