Event cameras, or Dynamic Vision Sensor (DVS), are very promising sensors which have shown several advantages over frame based cameras. However, most recent work on real applications of these cameras is focused on 3D reconstruction and 6-DOF camera tracking. Deep learning based approaches, which are leading the state-of-the-art in visual recognition tasks, could potentially take advantage of the benefits of DVS, but some adaptations are needed still needed in order to effectively work on these cameras. This work introduces a first baseline for semantic segmentation with this kind of data. We build a semantic segmentation CNN based on state-of-the-art techniques which takes event information as the only input. Besides, we propose a novel representation for DVS data that outperforms previously used event representations for related tasks. Since there is no existing labeled dataset for this task, we propose how to automatically generate approximated semantic segmentation labels for some sequences of the DDD17 dataset, which we publish together with the model, and demonstrate they are valid to train a model for DVS data only. We compare our results on semantic segmentation from DVS data with results using corresponding grayscale images, demonstrating how they are complementary and worth combining.
This work presents a novel approach for semi-supervised semantic segmentation. The key element of this approach is our contrastive learning module that enforces the segmentation network to yield similar pixel-level feature representations for same-class samples across the whole dataset.To achieve this, we maintain a memory bank which is continuously updated with relevant and high-quality feature vectors from labeled data. In an end-to-end training, the features from both labeled and unlabeled data are optimized to be similar to same-class samples from the memory bank. Our approach not only outperforms the current state-of-the-art for semi-supervised semantic segmentation but also for semi-supervised domain adaptation on well-known public benchmarks, with larger improvements on the most challenging scenarios, i.e., less available labeled data. Code is
Abstract-Vision-based topological localization and mapping for autonomous robotic systems have received increased research interest in recent years. The need to map larger environments requires models at different levels of abstraction and additional abilities to deal with large amounts of data efficiently. Most successful approaches for appearance-based localization and mapping with large datasets typically represent locations using local image features. We study the feasibility of performing these tasks in urban environments using global descriptors instead and taking advantage of the increasingly common panoramic datasets. This paper describes how to represent a panorama using the global gist descriptor, while maintaining desirable invariance properties for location recognition and loop detection. We propose different gist similarity measures and algorithms for appearance-based localization and an online loop-closure detection method, where the probability of loop closure is determined in a Bayesian filtering framework using the proposed image representation. The extensive experimental validation in this paper shows that their performance in urban environments is comparable with local-feature-based approaches when using wide field-of-view images.
This paper addresses the robot and landmark localization problem from bearing-only data in three views, simultaneously to the robust association of this data. The localization algorithm is based on the 1D trifocal tensor, which relates linearly the observed data and the robot localization parameters. The aim of this work is to bring this useful geometric construction from computer vision closer to robotic applications. One contribution is the evaluation of two linear approaches of estimating the 1D tensor: the commonly used approach that needs seven bearing-only correspondences and another one which uses only five correspondences plus two calibration constraints. The results in this paper show that the inclusion of these constraints provides a simpler and faster solution and better estimation of robot and landmark locations in the presence of noise. Moreover, a new method that makes use of scene planes and requires only four correspondences is presented. This proposal improves the performance of the two previously mentioned methods in typical man-made scenarios with dominant planes, while it gives similar results in other cases. The three methods are evaluated with simulation tests as well as with experiments that perform automatic real data matching in conventional and omnidirectional images. The results show sufficient accuracy and stability to be used in robotic tasks such as navigation, global localization or initialization of SLAM algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.