“…Beyond monocular SLAM, effective segmentation and tracking of dynamic objects can be achieved [1,4,27,32,39,68,78] with auxiliary depth data from stereo, RGB-D and LiDAR, which, however, is not generally available for in-the-wild captured videos. Thanks to the rapid development of deep learning on visual recognition, many works [3,5,85,89,93] tackle this problem by exploring the combination with object detection, semantic and instance segmentation. However, These methods are often restricted to pre-defined semantic classes.…”