MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

Shamsafar, Faranak; Woerz, Samuel; Rahim, Rafia; Zell, Andreas

doi:10.48550/arxiv.2108.09770

Cited by 3 publications

(12 citation statements)

References 28 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Xing et al [8] has proposed adjust multi-branch module which combines depth-wise convolution to reduce the number of channels. The mobilestereonet's [9] main contribution is that they have proposed to reform the cost volume with convolution and use the 2D convolution only for the disparity regression. However, the final FLOPs and latency are still too larger and far from real-time.…”

Section: Related Workmentioning

confidence: 99%

“…With the goal of achieving real-time processing on edge devices (GPU/NPU), many lightweight methods have been proposed [7]- [12]. Those methods can be roughly divided into two categories: multi-stage method [8], [10], [13] and model compression method [7], [9], [11]. The computational complexity of the network depends on two factors: the size 1 The authors are with the UBTech Robotics Corp, Shenzhen, China {baiyu.pan,jichao.jiao,walton}@ubtrobot.com 2 The author is with the Beijing University of Posts and Telecommunications, Beijing, China jiaojichao@bupt.edu.cn 3 The authors are with the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China jun.cheng@siat.ac.cn of the feature map and the number of convolution kernels.…”

Section: Introductionmentioning

confidence: 99%

“…However, replacing the 3D convolution is challenging. Shamsafar et al [9] has achieved promising result by compressing the channel. More importantly, only 2D convolution is used in their network.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Attention-Aware Feature Aggregation for Real-Time Stereo Matching on Edge Devices

Chang

Chen

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In recent years, numerous real-time stereo matching methods have been introduced, but they often lack accuracy. These methods attempt to improve accuracy by introducing new modules or integrating traditional methods. However, the improvements are only modest. In this paper, we propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off between speed and accuracy. As a result, we obtained a model that maintains real-time performance while delivering high accuracy on edge devices. Our proposed method involves three key steps. Firstly, we review state-of-the-art methods and design our lightweight model by removing redundant modules from those efficient models through a comparison of their contributions. Next, we leverage the efficient model as the teacher to distill knowledge into the lightweight model. Finally, we systematically prune the lightweight model to obtain the final model. Through extensive experiments conducted on two widely-used benchmarks, Sceneflow and KITTI, we perform ablation studies to analyze the effectiveness of each module and present our state-of-the-art results.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attention-Aware Feature Aggregation for Real-Time Stereo Matching on Edge Devices

Chang

Chen

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…With the rise of deep learning, stereo matching continued to be reformed by these modern techniques. Following the general paradigm for stereo reconstruction, deep models can be divided into two categories: the methods that formulate one or some of the steps with a deep learning framework [2,28,37], and the approaches that transfer the full process in an end-to-end scheme [4,12,19,21,29,39]. Following the recent research, our model is also an end-to-end one, processing a 3-tuple sample.…”

Section: Related Workmentioning

confidence: 99%

TriStereoNet: A Trinocular Framework for Multi-baseline Disparity Estimation

Shamsafar¹,

Zell²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Stereo vision is an effective technique for depth estimation with broad applicability in autonomous urban and highway driving. While various deep learning-based approaches have been developed for stereo, the input data from a binocular setup with a fixed baseline are limited. Addressing such a problem, we present an end-to-end network for processing the data from a trinocular setup, which is a combination of a narrow and a wide stereo pair. In this design, two pairs of binocular data with a common reference image are treated with shared weights of the network and a mid-level fusion. We also propose a Guided Addition method for merging the 4D data of the two baselines. Additionally, an iterative sequential selfsupervised and supervised learning on real and synthetic datasets is presented, making the training of the trinocular system practical with no need to ground-truth data of the real dataset. Experimental results demonstrate that the trinocular disparity network surpasses the scenario where individual pairs are fed into a similar architecture. Code and dataset: https://github.com/cogsystuebingen/tristereonet.

show abstract

“…The cell-phone of the 90s was a phone, the modern cellphone is a handheld computational imaging platform [9] that is capable of acquiring high-quality images, pose, and depth. Recent years have witnessed explosive advances in passive depth imaging, from single-image methods that leverage large data priors to predict structure directly from image features [39,40] to efficient multi-view approaches grounded in principles of 3D geometry and epipolar projection [49,46]. Alongside, progress has been made in the miniaturization and cost-reduction [3] of active depth systems such as LiDAR and correlation time-of-flight sensors [29].…”

Section: Introductionmentioning

confidence: 99%

The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement

Chugunov¹,

Zhang²,

Xia³

et al. 2021

Preprint

View full text Add to dashboard Cite

Modern smartphones can continuously stream multimegapixel RGB images at 60 Hz, synchronized with highquality 3D pose information and low-resolution LiDARdriven depth estimates. During a snapshot photograph, the natural unsteadiness of the photographer's hands offers millimeter-scale variation in camera pose, which we can capture along with RGB and depth in a circular buffer. In this work we explore how, from a bundle of these measurements acquired during viewfinding, we can combine dense micro-baseline parallax cues with kilopixel LiDAR depth to distill a high-fidelity depth map. We take a test-time optimization approach and train a coordinate MLP to output photometrically and geometrically consistent depth estimates at the continuous coordinates along the path traced by the photographer's natural hand shake. The proposed method brings high-resolution depth estimates to "pointand-shoot" tabletop photography and requires no additional hardware, artificial hand motion, or user interaction beyond the press of a button.

show abstract

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

Cited by 3 publications

References 28 publications

Attention-Aware Feature Aggregation for Real-Time Stereo Matching on Edge Devices

Attention-Aware Feature Aggregation for Real-Time Stereo Matching on Edge Devices

TriStereoNet: A Trinocular Framework for Multi-baseline Disparity Estimation

The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement

Contact Info

Product

Resources

About