Figure 1: Our system enables the real-time capture of general shapes undergoing non-rigid deformations using a single depth camera. Top left: the object to be captured is scanned while undergoing rigid deformations, creating a base template. Bottom left: the object is manipulated and our method deforms the template to track the object. Top and middle row: we show our reconstruction for upper body, face, and hand sequences being captured in different poses as they are deformed. Bottom row: we show corresponding color and depth data for the reconstructed mesh in the middle row. AbstractWe present a combined hardware and software solution for markerless reconstruction of non-rigidly deforming physical objects with arbitrary shape in real-time. Our system uses a single self-contained stereo camera unit built from off-the-shelf components and consumer graphics hardware to generate spatio-temporally coherent 3D models at 30 Hz. A new stereo matching algorithm estimates real-time RGB-D data. We start by scanning a smooth template model of the subject as they move rigidly. This geometric surface prior avoids strong scene assumptions, such as a kinematic human skeleton or a parametric shape model. Next, a novel GPU pipeline performs non-rigid registration of live RGB-D data to the smooth template using an extended non-linear as-rigid-as-possible (ARAP) framework. High-frequency details are fused onto the final mesh using a linear deformation model. The system is an order of magnitude faster than state-of-the-art methods, while matching the quality and robustness of many offline algorithms. We show precise real-time reconstructions of diverse scenes, including: large deformations of users' heads, hands, and upper bodies; fine-scale wrinkles and folds of skin and clothing; and non-rigid interactions performed by users on flexible objects such as toys. We demonstrate how acquired models can be used for many interactive scenarios, including re-texturing, online performance capture and preview, and real-time shape and motion re-targeting.
Personalized Blend Shapes and more results Fig. 4 shows a selection of expressions for the generic Emily model and for the four corresponding models derived from it which were used to generate the results shown in Figs. 5, 6, 7 and 8. Key Frame SelectionThe middle pane in Fig. 1 shows the three rectangular regions of fixed size around the eyes and mouth, selected on an example frame after aligning it with the reference frame f t 0 (a neutral rest pose). These regions are used to build the LBP descriptors which are used to automatically find key frames depicting a similar expression as the reference frame. The right pane shows the smaller regions around each of the 66 tracked feature points used to find in-between key frames that share a local appearance with the reference frame around the facial features. Coupling the 2D and 3D ModelTo couple the 66 sparse features that are tracked in the video to their corresponding 3D positions on the generic blend shape model, we render a frontal snap shot of the neutral pose and use the feature tracker to estimate the facial features. This works for a shaded OpenGL rendering of the model with constant material in front of a black background, but the detected features still need minor manual correction for better alignment. For the Emily blend shape model, the eyes are unnaturally large and these detected features need correction. As the 2D features are the projections of the corresponding 3D points on the blend shape model, correspondences can be easily established by back projection on the mesh. Since all personalized blend shape models used in our results are derived from the same generic Emily model, the indices of the found set of 3D feature vertices has to be the same for all actors. Thus, this step only needs to be completed once and only has to be repeated if a different generic face model is used. AverageAverage Sequence distance maximum distance (mm) (mm) Fig. 5 (over 565 frames) 1.71 7.45 Fig. 6 (over 402 frames) 2.91 9.82Comparison with Binocular Reconstruction The 3D reconstruction quality of the binocular method of [Valgaerts et al. 2012] is quite high, but our monocular method is also able to capture high frequency detail and produces very accurate overlays (see also the comparison in the main paper and the video). In Tab. 1, we provide quantitative results for the comparison in the main paper. It lists the average Euclidean distance between the nearest visible vertices on the binocular and monocular meshes for the sequences of Fig. 5 and Fig. 6 for a mesh size of 200k. The deviation of our monocular result from the binocular results lies in the millimeter range despite the lack of direct depth information. A color coded overlay of this distance for the first sequence is shown in Fig. 5. For this comparison, the nearest vertices between the binocular and monocular result were recomputed for each frame, thus highlighting the shape reconstruction accuracy. However, if we determine the nearest vertices in the reference frame and keep them fixed over all other ...
Recent progress in passive facial performance capture has shown impressively detailed results on highly articulated motion. However, most methods rely on complex multi-camera set-ups, controlled lighting or fiducial markers. This prevents them from being used in general environments, outdoor scenes, during live action on a film set, or by freelance animators and everyday users who want to capture their digital selves. In this paper, we therefore propose a lightweight passive facial performance capture approach that is able to reconstruct high-quality dynamic facial geometry from only a single pair of stereo cameras. Our method succeeds under uncontrolled and time-varying lighting, and also in outdoor scenes. Our approach builds upon and extends recent image-based scene flow computation, lighting estimation and shading-based refinement algorithms. It integrates them into a pipeline that is specifically tailored towards facial performance reconstruction from challenging binocular footage under uncontrolled lighting. In an experimental evaluation, the strong capabilities of our method become explicit: We achieve detailed and spatio-temporally coherent results for expressive facial motion in both indoor and outdoor scenes -even from low quality input images recorded with a hand-held consumer stereo camera. We believe that our approach is the first to capture facial performances of such high quality from a single stereo rig and we demonstrate that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.
Figure 1: Our method obtains fine-scale detail through volumetric shading-based refinement (VSBR) of a distance field. We scan an object using a commodity sensor -here, a PrimeSense -to generate an implicit representation. Unfortunately, this leads to over-smoothing. Exploiting the shading cues from the RGB data allows us to obtain reconstructions at previously unseen resolutions within only a few seconds. AbstractWe present a novel method to obtain fine-scale detail in 3D reconstructions generated with low-budget RGB-D cameras or other commodity scanning devices. As the depth data of these sensors is noisy, truncated signed distance fields are typically used to regularize out the noise, which unfortunately leads to over-smoothed results. In our approach, we leverage RGB data to refine these reconstructions through shading cues, as color input is typically of much higher resolution than the depth data. As a result, we obtain reconstructions with high geometric detail, far beyond the depth resolution of the camera itself. Our core contribution is shading-based refinement directly on the implicit surface representation, which is generated from globally-aligned RGB-D images. We formulate the inverse shading problem on the volumetric distance field, and present a novel objective function which jointly optimizes for fine-scale surface geometry and spatially-varying surface reflectance. In order to enable the efficient reconstruction of sub-millimeter detail, we store and process our surface using a sparse voxel hashing scheme which we augment by introducing a grid hierarchy. A tailored GPU-based Gauss-Newton solver enables us to refine large shape models to previously unseen resolution within only a few seconds.
Figure 1: Our method takes as input depth and aligned RGB images from any consumer depth camera (here a PrimeSense Carmine 1.09). Per-frame and in real-time we approximate the incident lighting and albedo, and use these for geometry refinement. From left: Example input depth and RGB image; raw depth input prior to refinement (rendered with normals and phong shading, respectively); our refined result, note detail on the eye (top right) compared to original depth map (bottom right); full 3D reconstruction using our refined depth maps in the real-time scan integration method of [Nießner et al. 2013] (far right) AbstractWe present the first real-time method for refinement of depth data using shape-from-shading in general uncontrolled scenes. Per frame, our real-time algorithm takes raw noisy depth data and an aligned RGB image as input, and approximates the time-varying incident lighting, which is then used for geometry refinement. This leads to dramatically enhanced depth maps at 30Hz. Our algorithm makes few scene assumptions, handling arbitrary scene objects even under motion. To enable this type of real-time depth map enhancement, we contribute a new highly parallel algorithm that reformulates the inverse rendering optimization problem in prior work, allowing us to estimate lighting and shape in a temporally coherent way at video frame-rates. Our optimization problem is minimized using a new regular grid Gauss-Newton solver implemented fully on the GPU. We demonstrate results showing enhanced depth maps, which are comparable to offline methods but are computed orders of magnitude faster, as well as baseline comparisons with online filtering-based methods. We conclude with applications of our higher quality depth maps for improved real-time surface reconstruction and performance capture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.