“…First, simplified 3D motion and scene models were considered. For example, Censi et al [6] used a known map of markers, Gallego et al [7] considered known sets of poses and depth maps, and Kim et al and Reinbacher et al [12], [20] only considered rotation. Other approaches fused events with IMU data [26].…”
Event-based vision sensors, such as the Dynamic Vision Sensor (DVS), are ideally suited for real-time motion analysis. The unique properties encompassed in the readings of such sensors provide high temporal resolution, superior sensitivity to light and low latency. These properties provide the grounds to estimate motion efficiently and reliably in the most sophisticated scenarios, but these advantages come at a price -modern event-based vision sensors have extremely low resolution, produce a lot of noise and require the development of novel algorithms to handle the asynchronous event stream.This paper presents a new, efficient approach to object tracking with asynchronous cameras. We present a novel event stream representation which enables us to utilize information about the dynamic (temporal) component of the event stream. The 3D geometry of the event stream is approximated with a parametric model to motion-compensate for the camera (without feature tracking or explicit optical flow computation), and then moving objects that don't conform to the model are detected in an iterative process. We demonstrate our framework on the task of independent motion detection and tracking, where we use the temporal model inconsistencies to locate differently moving objects in challenging situations of very fast motion.
SUPPLEMENTARY MATERIALThe supplementary video materials and datasets will be made available at
“…First, simplified 3D motion and scene models were considered. For example, Censi et al [6] used a known map of markers, Gallego et al [7] considered known sets of poses and depth maps, and Kim et al and Reinbacher et al [12], [20] only considered rotation. Other approaches fused events with IMU data [26].…”
Event-based vision sensors, such as the Dynamic Vision Sensor (DVS), are ideally suited for real-time motion analysis. The unique properties encompassed in the readings of such sensors provide high temporal resolution, superior sensitivity to light and low latency. These properties provide the grounds to estimate motion efficiently and reliably in the most sophisticated scenarios, but these advantages come at a price -modern event-based vision sensors have extremely low resolution, produce a lot of noise and require the development of novel algorithms to handle the asynchronous event stream.This paper presents a new, efficient approach to object tracking with asynchronous cameras. We present a novel event stream representation which enables us to utilize information about the dynamic (temporal) component of the event stream. The 3D geometry of the event stream is approximated with a parametric model to motion-compensate for the camera (without feature tracking or explicit optical flow computation), and then moving objects that don't conform to the model are detected in an iterative process. We demonstrate our framework on the task of independent motion detection and tracking, where we use the temporal model inconsistencies to locate differently moving objects in challenging situations of very fast motion.
SUPPLEMENTARY MATERIALThe supplementary video materials and datasets will be made available at
“…The problem of 3D motion estimation was studied following the visual odometry and SLAM formulation for the case of rotation only [28], with known maps [38], [7], [15], by combining event-based data with image measurements [21], [32], and using IMU sensors [44]. Other recent approaches jointly reconstruct the image intensity of the scene, and estimate 3D motion.…”
Section: A Event Based Optical Flow Depth and Motion Estimationmentioning
We present the first event-based learning approach for motion segmentation in indoor scenes and the first eventbased dataset -EV-IMO -which includes accurate pixelwise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates independently moving object segmentation at the pixel-level and computes per-object 3D translational velocities of moving objects. We also train a shallow network with just 40k parameters, which is able to compute depth and egomotion.Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast moving objects in the camera field of view. The objects and the camera are tracked using a VICON R motion capture system. By 3D scanning the room and the objects, ground truth of the depth map and pixel-wise object masks are obtained. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that it is well suited for scene constrained robotics applications.
SUPPLEMENTARY MATERIALThe supplementary video, code, trained models and appendix will be made available at
“…The features have been used in 3D motion estimation approaches using visual odometry or SLAM formulations, for rotational motion only [31], known maps [40,14], and in combination with IMU sensors [42]. Other recent approaches jointly reconstruct the image intensity of the scene, and estimate 3D motion [17,18].…”
Segmentation of moving objects in dynamic scenes is a key process in scene understanding for both navigation and video recognition tasks. Without prior knowledge of the object structure and motion, the problem is very challenging due to the plethora of motion parameters to be estimated while being agnostic to motion blur and occlusions. Event sensors, because of their high temporal resolution, and lack of motion blur, seem well suited for addressing this problem. We propose a solution to multi-object motion segmentation using a combination of classical optimization methods along with deep learning and does not require prior knowledge of the 3D motion and the number and structure of objects. Using the events within a time-interval, the method estimates and compensates for the global rigid motion. Then it segments the scene into multiple motions by iteratively fitting and merging models using as input tracked feature regions via alignment based on temporal gradients and contrast measures. The approach was successfully evaluated on both challenging real-world and synthetic scenarios from the EV-IMO, EED and MOD datasets and outperforms the state-of-the-art detection rate by as much as 12% achieving a new state-of-the-art average detection rate of 77.06%, 94.2% and 82.35% on the aforementioned datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.