Visual-Inertial Multi-Instance Dynamic SLAM with Object-level Relocalisation

Ren, Yifei; Xu, Bugong; Choi, Christopher L.; Leutenegger, Stefan

doi:10.1109/iros47612.2022.9981795

Cited by 12 publications

(13 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, Dynamic-VINS [19] refines 2D bounding boxes generated from YOLOv3 [20] and removes feature points of dynamic objects with a recourse-limited platform. Ren et al [5] proposes a dense RGB-D-inertial SLAM system that can track and relocalise multiple dynamic objects with the aid of instance segmentation from Mask R-CNN [12]. In contrast, DynaVINS [8] can remove undefined dynamic objects that are dominant in the visual input using the camera motion priors from a low-cost IMU.…”

Section: B Proprioception-aided Slammentioning

confidence: 99%

“…The dynamic objects can, therefore, be removed as outliers during robust camera tracking. On the other hand, when the categories of dynamic objects are predefined, the regions containing these objects can be directly detected using deep learning methods [5]. In the scenario of long-term large occlusion, the majority of camera view is occluded for the majority of time frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

RGB-D-Inertial SLAM in Indoor Dynamic Environments with Long-term Large Occlusion

Long¹,

Rauch²,

Ivan³

et al. 2023

Preprint

View full text Add to dashboard Cite

This work presents a novel RGB-D-inertial dynamic SLAM method that can enable accurate localisation when the majority of the camera view is occluded by multiple dynamic objects over a long period of time. Most dynamic SLAM approaches either remove dynamic objects as outliers when they account for a minor proportion of the visual input, or detect dynamic objects using semantic segmentation before camera tracking. Therefore, dynamic objects that cause large occlusions are difficult to detect without prior information. The remaining visual information from the static background is also not enough to support localisation when large occlusion lasts for a long period. To overcome these problems, our framework presents a robust visual-inertial bundle adjustment that simultaneously tracks camera, estimates cluster-wise dense segmentation of dynamic objects and maintains a static sparse map by combining dense and sparse features. The experiment results demonstrate that our method achieves promising localisation and object segmentation performance compared to other state-of-the-art methods in the scenario of long-term large occlusion.

show abstract

Section: B Proprioception-aided Slammentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

RGB-D-Inertial SLAM in Indoor Dynamic Environments with Long-term Large Occlusion

Long¹,

Rauch²,

Ivan³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…SLAM systems use semantics for better pose estimation or re-localization [2], [3] or to work in dynamic scenes [2], [3]. It can facilitate downstream tasks such as robotic navigation [4] or augmented reality (AR) experiences [5].…”

Section: Introductionmentioning

confidence: 99%

“…Real-time semantic mapping methods usually rely on 2D convolutional neural networks with optional 3D postprocessing (2D-3D networks) to annotate incoming images with semantics, using back-projection to lift the semantic labels to the 3D map [6], [3], [7], [5], [8], [1], while 1 Slamcore Ltd. 2 University College London recent FP-Conv [7] or SVCNN [6] also rely on lightweight 3D post-processing. 2D-3D networks repetitively process images with similar visual content, solving 2D semantic segmentation from scratch for each image, which may be redundant [9]; lack multi-view consistency in 2D labels [10]; suffer from occlusions or object scale uncertainty [11].…”

Section: Introductionmentioning

confidence: 99%

“…However, many semantic SLAM tasks require 2D image-level semantic labels, e.g. for semantic landmark annotation as in [2] or for semanticsbased processing of dynamics as in [14]. This makes the use of 3D networks as main semantic inference engines in SLAM systems less appealing, even given recent progress towards using 3D networks in an online and incremental fashion [15].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tripartite Real-time Semantic Segmentation Network with Scene Commonality

Wang,

Liu

et al. 2023

Preprint

View full text Add to dashboard Cite

The two-branch real-time semantic segmentation network can quickly acquire low-level details and high-level semantics. However, the large contextual gap between them results in adverse impact on their fusion, and limits the further improvement of real-time segmentation accuracy. This paper proposes a Tripartite Real-time Semantic Segmentation Network with Scene Commonality (TriSCNet) to address this problem. Firstly, we add a parallel Scene Commonality Branch (SCB) based on the current two-branch architecture to learn intrinsic common features in similar street scene images, such as the spatial location distribution of various objects and the internal connections between them at the semantic level. Further, with the guidance of commonality, we propose an External Branch Attention Module (EBAM) to enrich and enhance the feature information of traditional two branches. Lastly, we utilize an Alignment and Selective Fusion Module (ASFM) to correct the misaligned context in the semantic branch and highlight the essential spatial information in the detailed branch. Our proposed TriSCNet achieves an excellent trade-off between accuracy and speed, yielding 79.6% mIOU at 67.2 FPS on Cityscapes test set and 76.8% mIOU at 127.4 FPS on CamVid test set, respectively.

show abstract