Self-supervised Object Motion and Depth Estimation from Video

Dai, Qi; Patil, Vaishakh; Hecker, Simon; Dai, Dengxin; Gool, Luc Van; Schindler, Konrad

doi:10.1109/cvprw50498.2020.00510

Cited by 35 publications

(14 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The core idea is to apply differentiable warp and minimize photometric reprojection error. Recent methods improve the performance through incorporating coupled training with optical flow [Ranjan et al 2019;Yin and Shi 2018;Zou et al 2018], object motion [Dai et al 2019;Vijayanarasimhan et al 2017], surface normal [Qi et al 2018], edge , and visual odometry [Andraghetti et al 2019;Shi et al 2019;Wang et al 2018b]. Other notable efforts include using stereo information [Guo et al 2018;Watson et al 2019], better network architecture and training loss design [Gordon et al 2019;Guizilini et al 2019], scale-consistent ego-motion network [Bian et al 2019], incorporating 3D geometric constraints [Mahjourian et al 2018], and learning from unknown camera intrinsics [Chen et al 2019b;Gordon et al 2019].…”

Section: Related Workmentioning

confidence: 99%

Consistent video depth estimation

et al. 2020

View full text Add to dashboard Cite

We present an algorithm for reconstructing dense, geometrically consistent depth for all pixels in a monocular video. We leverage a conventional structure-from-motion reconstruction to establish geometric constraints on pixels in the video. Unlike the ad-hoc priors in classical reconstruction, we use a learning-based prior, i.e., a convolutional neural network trained for single-image depth estimation. At test time, we fine-tune this network to satisfy the geometric constraints of a particular input video, while retaining its ability to synthesize plausible depth details in parts of the video that are less constrained. We show through quantitative validation that our method achieves higher accuracy and a higher degree of geometric consistency than previous monocular reconstruction methods. Visually, our results appear more stable. Our algorithm is able to handle challenging hand-held captured input videos with a moderate degree of dynamic motion. The improved quality of the reconstruction enables several applications, such as scene reconstruction and advanced video-based visual effects.

show abstract

Section: Related Workmentioning

confidence: 99%

Consistent video depth estimation

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Forecasting of non-semantic targets: The most common forecasting techniques operate on trajectories. They track and anticipate the future position of individual objects, either in 2D or 3D [15,46,16,71]. For instance, Hsieh et al [26] disentangle position and pose of multiple moving objects -but only on synthetic data.…”

Section: Methods That Anticipatementioning

confidence: 99%

Panoptic Segmentation Forecasting

Graber¹,

Tsai²,

Firman³

et al. 2021

Preprint

View full text Add to dashboard Cite

Our goal is to forecast the near future given a set of recent observations. We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents which need not only passively analyze an observation but also must react to it in real-time. Importantly, accurate forecasting hinges upon the chosen scene decomposition. We think that superior forecasting can be achieved by decomposing a dynamic scene into individual 'things' and background 'stuff'. Background 'stuff' largely moves because of camera motion, while foreground 'things' move because of both camera and individual object motion. Following this decomposition, we introduce panoptic segmentation forecasting. Panoptic segmentation forecasting opens up a middle-ground between existing extremes, which either forecast instance trajectories or predict the appearance of future image frames. To address this task we develop a twocomponent model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things. We establish a leaderboard for this novel task, and validate a state-of-theart model that outperforms available baselines.

show abstract

“…However, DFNet does not distinguish between dynamic and static regions when calculating optical flow consistency constraints. For this reason, Casser et al [29] and Dai et al [32] both use a pretrained semantic segmentation network to obtain the mask of the dynamic region. Although cc [33] also uses neural networks to estimate dynamic and static regions, CC does not use a pre-trained network, but adds region segmentation to the self-supervised framework in a competitive and cooperative manner.…”

Section: Unsupervised Joint Learning Of Depth Optical Flow and Ego-mo...mentioning

confidence: 99%

Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Li,

Zhao,

Song

et al. 2021

Preprint

View full text Add to dashboard Cite

Estimating geometric elements such as depth, camera motion, and optical flow from images is an important part of the robot's visual perception. We use a joint selfsupervised method to estimate the three geometric elements. Depth network, optical flow network and camera motion network are independent of each other but are jointly optimized during training phase. Compared with independent training, joint training can make full use of the geometric relationship between geometric elements and provide dynamic and static information of the scene. In this paper, we improve the joint self-supervision method from three aspects: network structure, dynamic object segmentation, and geometric constraints. In terms of network structure, we apply the attention mechanism to the camera motion network, which helps to take advantage of the similarity of camera movement between frames. And according to attention mechanism in Transformer, we propose a plug-and-play convolutional attention module. In terms of dynamic object, according to the different influences of dynamic objects in the optical flow self-supervised framework and the depth-pose self-supervised framework, we propose a threshold algorithm to detect dynamic regions, and mask that in the loss function respectively. In terms of geometric constraints, we use traditional methods to estimate the fundamental matrix from the corresponding points to constrain the camera motion network. We demonstrate the effectiveness of our method on the KITTI dataset. Compared with other joint self-supervised methods, our method achieves state-of-the-art performance in the estimation of pose and optical flow, and the depth estimation has also achieved competitive results. Code will be available at:https://github.com/jianfenglihg/Unsupervised geometry.

show abstract

Self-supervised Object Motion and Depth Estimation from Video

Cited by 35 publications

References 29 publications

Consistent video depth estimation

Consistent video depth estimation

Panoptic Segmentation Forecasting

Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Contact Info

Product

Resources

About