End-to-End Depth From Motion With Stabilized Monocular Videos

Pinard, Clément; Chevalley, Laure; Manzanera, Antoine; Filliat, David

doi:10.5194/isprs-annals-iv-2-w3-67-2017

Cited by 6 publications

(8 citation statements)

References 37 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In practice, these labels are expensive to obtain and, thus, limit the data quantity and thereby the application of deep learning methods. To cope with the given data limitations, one possibility is to generate artificial datasets [10,19], but the transfer from synthetic datasets to reality is still accompanied by a significant decrease in performance.…”

Section: Related Workmentioning

confidence: 99%

“…Their central aim was, similar to ours, to establish a depth network that can incorporate structure from motion in its prediction, instead of relying only on structure from scene geometry as the single-frame approaches do. In practice, [9] beat the basic SfML framework only on the artificial StillBox dataset [10] which is showing random shapes and textures in a 3D space. Despite a superior performance on StillBox, the results on the autonomous driving benchmark KITTI [24] were similar and in part worse than the baseline architecture.…”

Section: Related Workmentioning

confidence: 99%

“…As it is the first real data for that case, the results we observe are crucial to determine the direction of the research field of monocular depth in visual cue sparse environments. We conducted experiments by training different models on the artificial dataset StillBox [10] and the established indoor depth benchmark NYU Depth V2 [11], Furthermore, we benchmark our proposed model on the Wild-UAV dataset [12] which depicts bird's-eye view shots in rural outdoor environments and contains dense annotated depth to check the generalization abilities of our model.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Improved deep depth estimation for environments with sparse visual cues

Joswig

Autiosalo

Ruotsalainen

2023

Machine Vision and Applications

View full text Add to dashboard Cite

Most deep learning-based depth estimation models that learn scene structure self-supervised from monocular video base their estimation on visual cues such as vanishing points. In the established depth estimation benchmarks depicting, for example, street navigation or indoor offices, these cues can be found consistently, which enables neural networks to predict depth maps from single images. In this work, we are addressing the challenge of depth estimation from a real-world bird’s-eye perspective in an industry environment which contains, conditioned by its special geometry, a minimal amount of visual cues and, hence, requires incorporation of the temporal domain for structure from motion estimation. To enable the system to incorporate structure from motion from pixel translation when facing context-sparse, i.e., visual cue sparse, scenery, we propose a novel architecture built upon the structure from motion learner, which uses temporal pairs of jointly unrotated and stacked images for depth prediction. In order to increase the overall performance and to avoid blurred depth edges that lie in between the edges of the two input images, we integrate a geometric consistency loss into our pipeline. We assess the model’s ability to learn structure from motion by introducing a novel industry dataset whose perspective, orthogonal to the floor, contains only minimal visual cues. Through the evaluation with ground truth depth, we show that our proposed method outperforms the state of the art in difficult context-sparse environments.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Improved deep depth estimation for environments with sparse visual cues

Joswig

Autiosalo

Ruotsalainen

2023

Machine Vision and Applications

View full text Add to dashboard Cite

show abstract

“…Our network is broadly inspired from FlowNetS [3] (initially used for flow inference) and called DepthNet. It is described in details in [18], we provide here a summary of its structure (Fig 3) and performances. Each convolution (apart from depth modules) is followed by a Spatial Batch Normalization and ReLU activation layer.…”

Section: Depth Inference Trainingmentioning

confidence: 99%

“…No preprocessing such as optical flow computation, nor visual odometry is applied to the input, while the depth is directly provided as an output. [18] a Parrot, Paris, France (clement.pinard, laure.chevalley)@parrot.com b U2IS, ENSTA ParisTech, Université Paris-Saclay, Palaiseau, France (clement.pinard, antoine.manzanera, david.filliat)@ensta-paristech.fr We created a dataset of image pairs with random translation movements, with no rotation, and a constant displacement magnitude applied during the whole training.…”

Section: Introductionmentioning

confidence: 99%

Multi range real-time depth inference from a monocular stabilized footage using a fully convolutional neural network

Pinard

Chevalley

Manzanera

et al. 2017

2017 European Conference on Mobile Robots (ECMR)

View full text Add to dashboard Cite

Using a neural network architecture for depth map inference from monocular stabilized videos with application to UAV videos in rigid scenes, we propose a multi-range architecture for unconstrained UAV flight, leveraging flight data from sensors to make accurate depth maps for uncluttered outdoor environment.We try our algorithm on both synthetic scenes and real UAV flight data. Quantitative results are given for synthetic scenes with a slightly noisy orientation, and show that our multi-range architecture improves depth inference.Along with this article is a video that present our results more thoroughly.Using the trained network, we propose an algorithm for real condition depth inference from a stabilized UAV. Displacement from sensors is used to compute real depth map, as it only differs from the synthetic constant displacement images by a scale factor. Our network output also allows us to a posteriori optimize the depth inference. By adjusting frame shift to get a displacement that would make the network get the same disparity distribution as during its training, we lower the depth error for next inference. For example, with large distances, ideal displacement between two frames is higher, and thus the shift is also higher for a given speed. Moreover, we use multiple batch inference to compute multiple depth maps centered around a particular range, and fuse them to get a high precision for both close and far objects, no matter the distance, given a sufficient displacement from the UAV. II. RELATED WORKDeep Learning and Convolutional Neural Networks have recently been widely used for numerous kinds of vision problem such as classification [13] and hand-written digits recognition [14].

show abstract

Learning Structure-from-Motion from Motion

Pinard¹,

Chevalley²,

Manzanera³

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This work is based on a questioning of the quality metrics used by deep neural networks performing depth prediction from a single image, and then of the usability of recently published works on unsupervised learning of depth from videos. These works are all predicting depth from a single image, thus it is only known up to an undetermined scale factor, which is not sufficient for practical use cases that need an absolute depth map, i.e. the determination of the scaling factor. To overcome these limitations, we propose to learn in the same unsupervised manner a depth map inference system from monocular videos that takes a pair of images as input. This algorithm actually learns structure-frommotion from motion, and not only structure from context appearance. The scale factor issue is explicitly treated, and the absolute depth map can be estimated from camera displacement magnitude, which can be easily measured from cheap external sensors. Our solution is also much more robust with respect to domain variation and adaptation via fine tuning, because it does not rely entirely on depth from context. Two use cases are considered, unstabilized moving camera videos, and stabilized ones. This choice is motivated by the UAV (for Unmanned Aerial Vehicle) use case that generally provides reliable orientation measurement. We provide a set of experiments showing that, used in real conditions where only speed can be known, our network outperforms competitors for most depth quality measures. Results are given on the well known KITTI dataset [1], which provides robust stabilization for our second use case, but also contains moving scenes which are very typical of the in-car road context. We then present results on a synthetic dataset that we believe to be more representative of typical UAV scenes. Lastly, we present two domain adaptation use cases showing superior robustness of our method compared to single view depth algorithms, which indicates that it is better suited for highly variable visual contexts.

show abstract

End-to-End Depth From Motion With Stabilized Monocular Videos

Cited by 6 publications

References 37 publications

Improved deep depth estimation for environments with sparse visual cues

Improved deep depth estimation for environments with sparse visual cues

Multi range real-time depth inference from a monocular stabilized footage using a fully convolutional neural network

Learning Structure-from-Motion from Motion

Contact Info

Product

Resources

About