Does computer vision matter for action?

Zhou, Brady; Krähenbühl, Philipp; Koltun, Vladlen

doi:10.1126/scirobotics.aaw6661

Cited by 91 publications

(61 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…D EPTH is among the most useful intermediate representations for action in physical environments [1]. Despite its utility, monocular depth estimation remains a challenging problem that is heavily underconstrained.…”

Section: Introductionmentioning

confidence: 99%

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

Ranftl

Lasinger²,

Hafner

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

781

603

View full text Add to dashboard Cite

The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.

show abstract

Section: Introductionmentioning

confidence: 99%

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

Ranftl

Lasinger²,

Hafner

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

781

603

View full text Add to dashboard Cite

show abstract

“…Oxford Robotcar [27] was the first real-world large-scale dataset in which adverse visual conditions such as nighttime, rain and snow were significantly represented, but it did not feature semantic annotations. While more recent large-scale sets [2,30] that cover adverse conditions, such as Waymo Open [42] and nuScenes [3], include bounding boxes, they still lack dense pixel-level semantic annotations, which are vital for real-world autonomous agents [63]. BDD100K [55] is the only exception to this rule, with ca.…”

Section: Related Workmentioning

confidence: 99%

ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding

Sakaridis¹,

Dai²,

Gool

2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

231

134

View full text Add to dashboard Cite

Level 5 autonomy for self-driving cars requires a robust visual perception system that can parse input images under any visual condition. However, existing semantic segmentation datasets are either dominated by images captured under normal conditions or are small in scale. To address this, we introduce ACDC, the Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. ACDC consists of a large set of 4006 images which are equally distributed between four common adverse conditions: fog, nighttime, rain, and snow. Each adverse-condition image comes with a high-quality fine pixel-level semantic annotation, a corresponding image of the same scene taken under normal conditions, and a binary mask that distinguishes between intra-image regions of clear and uncertain semantic content. Thus, ACDC supports both standard semantic segmentation and the newly introduced uncertainty-aware semantic segmentation. A detailed empirical study demonstrates the challenges that the adverse domains of ACDC pose to state-of-the-art supervised and unsupervised approaches and indicates the value of our dataset in steering future progress in the field. Our dataset and benchmark are publicly available.

show abstract

“…The field of unsupervised learning has explored different ways to learn state representations [27,28,29] for policy learning. Recently, several works [30,31,32] have studied the benefits of combining various mid-level visual representations for reinforcement learning. Different from previous works, we 1) demonstrate real-world results on manipulation while previous works present simulation results on navigation, 2) highlight the importance of using a pre-trained model for exploration, and 3) propose an initialization strategy without the need of data collection at target domain.…”

Section: Related Workmentioning

confidence: 99%

Learning to See before Learning to Act: Visual Pre-training for Manipulation

Lin

Zeng

Song

et al. 2020

2020 IEEE International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

Does having visual priors (e.g. the ability to detect objects) facilitate learning to perform vision-based manipulation (e.g. picking up objects)? We study this problem under the framework of transfer learning, where the model is first trained on a passive vision task (i.e., the data distribution does not depend on the agent's decisions), then adapted to perform an active manipulation task (i.e., the data distribution does depend on the agent's decisions). We find that pre-training on vision tasks significantly improves generalization and sample efficiency for learning to manipulate objects. However, realizing these gains requires careful selection of which parts of the model to transfer. Our key insight is that outputs of standard vision models highly correlate with affordance maps commonly used in manipulation. Therefore, we explore directly transferring model parameters from vision networks to affordance prediction networks, and show that this can result in successful zeroshot adaptation, where a robot can pick up certain objects with zero robotic experience. With just a small amount of robotic experience, we can further fine-tune the affordance model to achieve better results. With just 10 minutes of suction experience or 1 hour of grasping experience, our method achieves ∼ 80% success rate at picking up novel objects.

show abstract

Does computer vision matter for action?

Cited by 91 publications

References 7 publications

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding

Learning to See before Learning to Act: Visual Pre-training for Manipulation

Contact Info

Product

Resources

About