Fig. 1: Globally consistent reconstructions produced by our approach, based on the Flat, House and Lab subsets of our dataset.Abstract-Reconstructing dense, volumetric models of real-world 3D scenes is important for many tasks, but capturing large scenes can take significant time, and the risk of transient changes to the scene goes up as the capture time increases. These are good reasons to want instead to capture several smaller sub-scenes that can be joined to make the whole scene. Achieving this has traditionally been difficult: joining sub-scenes that may never have been viewed from the same angle requires a high-quality camera relocaliser that can cope with novel poses, and tracking drift in each sub-scene can prevent them from being joined to make a consistent overall scene. Recent advances, however, have significantly improved our ability to capture medium-sized sub-scenes with little to no tracking drift: real-time globally consistent reconstruction systems can close loops and re-integrate the scene surface on the fly, whilst new visual-inertial odometry approaches can significantly reduce tracking drift during live reconstruction. Moreover, high-quality regression forest-based relocalisers have recently been made more practical by the introduction of a method to allow them to be trained and used online. In this paper, we leverage these advances to present what to our knowledge is the first system to allow multiple users to collaborate interactively to reconstruct dense, voxel-based models of whole buildings using only consumer-grade hardware, a task that has traditionally been both time-consuming and dependent on the availability of specialised hardware. Using our system, an entire house or lab can be reconstructed in under half an hour and at a far lower cost than was previously possible.Moreover, the risk of transient changes to the scene (e.g. people moving around) goes up as the capture time increases, corrupting the model and forcing the user to restart the capture. There are thus good reasons to want to split the capture into several shorter sequences, which can be captured either over multiple sessions or in parallel (by multiple users) and then joined to make the whole scene.Achieving this has traditionally been difficult: joining the sub-scenes requires the ability to accurately determine the relative transformations between them (a problem that can be expressed as camera relocalisation), even though the areas in which they overlap may never have been viewed from the same angles, and tracking drift in each sub-scene can prevent them from being joined to make a consistent overall scene. Recent advances, however, have significantly improved our ability to capture consistent, medium-sized sub-scenes, e.g. by closing loops and re-integrating the scene surface on the fly [17], which yields accurate poses for individual frames once loops have been closed, or by combining visual and inertial cues using an extended Kalman filter [28] to achieve accurate camera tracking during live reconstruction. Moreo...