Learned Multi-patch Similarity

Hartmann, Wilfried; Galliani, Silvano; Havlena, Michal; Gool, Luc Van; Schindler, Konrad

doi:10.1109/iccv.2017.176

Cited by 107 publications

(93 citation statements)

References 26 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, we have tested straight-forward, handcrafted averaging and voting schemes. It may however be interesting to also learn the combination, or even to explore an "early combination" where a multi-way similarity [20] is computed from a test example to a set of multiple exemplars.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Visual recognition in the wild by sampling deep similarity functions

Usvyatsov

Schindler

2019

2019 International Conference on Robotics and Automation (ICRA)

Self Cite

View full text Add to dashboard Cite

Recognising relevant objects or object states in its environment is a basic capability for an autonomous robot. The dominant approach to object recognition in images and range images is classification by supervised machine learning, nowadays mostly with deep convolutional neural networks (CNNs). This works well for target classes whose variability can be completely covered with training examples. However, a robot moving in the wild, i.e., in an environment that is not known at the time the recognition system is trained, will often face domain shift: the training data cannot be assumed to exhaustively cover all the within-class variability that will be encountered in the test data. In that situation, learning is in principle possible, since the training set does capture the defining properties, respectively dissimilarities, of the target classes. But directly training a CNN to predict class probabilities is prone to overfitting to irrelevant correlations between the class labels and the specific subset of the target class that is represented in the training set. We explore the idea to instead learn a Siamese CNN that acts as similarity function between pairs of training examples. Class predictions are then obtained by measuring the similarities between a new test instance and the training samples. We show that the CNN embedding correctly recovers the relative similarities to arbitrary class exemplars in the training set. And that therefore few, randomly picked training exemplars are sufficient to achieve good predictions, making the procedure efficient.

show abstract

Section: Resultsmentioning

confidence: 99%

“…After the advent of modern convolutional networks, the same idea was applied to raw images, e.g., [12,13,14,15,16]. Siamese convolutional branches independently transform two (or more) images A and B into high-level representations that are then merged and transformed further into a learned measure F (A, B) of similarity.…”

Section: Related Workmentioning

confidence: 99%

Visual recognition in the wild by sampling deep similarity functions

Usvyatsov

Schindler

2019

2019 International Conference on Robotics and Automation (ICRA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The simplest representation for 3D reconstruction from one or more images are 2.5D depth maps as they can be inferred using standard 2D convolutional neural networks [14,18,24,43]. Since depth maps are view-based, these methods require additional post-processing algorithms to fuse information from multiple viewpoints in order to capture the entire object geometry.…”

Section: D Reconstructionmentioning

confidence: 99%

“…This allows us to state the following: if the first primitive exists, the first primitive will be the one closest to point x i of the target point, if the first primitive does not exist and the second does, then the second primitive is closest to point x i and so forth. More formally, this property can be stated A 3D vector r(η, ω) defines a closed surface in space as η (latitude angle) and ω (longitude angle) change in the given intervals (14). The rigid body transformation T m (x) maps a point from the world coordinate system to the local coordinate system of the m th primitive.…”

Section: B Derivation Of Pointcloud-to-primitive Lossmentioning

confidence: 99%

“…In the last decade, major breakthroughs in shape extraction were due to deep neural networks coupled with the abundance of visual data. Recent works focus on learning 3D reconstruction using 2.5D [14,16,24,43], volumetric [7,11,13,18,30,42], mesh [12,21] and point cloud [10,27] representations. However, none of the above are sufficiently parsimonious or interpretable to allow for higher-level 3D scene understanding as required by intelligent systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Superquadrics Revisited: Learning 3D Shape Parsing Beyond Cuboids

Paschalidou

Ulusoy

Geiger

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

149

255

View full text Add to dashboard Cite

Abstracting complex 3D shapes with parsimonious partbased representations has been a long standing goal in computer vision. This paper presents a learning-based solution to this problem which goes beyond the traditional 3D cuboid representation by exploiting superquadrics as atomic elements. We demonstrate that superquadrics lead to more expressive 3D scene parses while being easier to learn than 3D cuboid representations. Moreover, we provide an analytical solution to the Chamfer loss which avoids the need for computational expensive reinforcement learning or iterative prediction. Our model learns to parse 3D objects into consistent superquadric representations without supervision. Results on various ShapeNet categories as well as the SURREAL human body dataset demonstrate the flexibility of our model in capturing fine details and complex poses that could not have been modelled using cuboids.

show abstract

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Yao

Luo

et al. 2018

Computer Vision – ECCV 2018

791

1,203

View full text Add to dashboard Cite

We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed MVSNet is demonstrated on the large-scale indoor DTU dataset. With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first before April 18, 2018 without any fine-tuning, showing the strong generalization ability of MVSNet.

show abstract

Learned Multi-patch Similarity

Cited by 107 publications

References 26 publications

Visual recognition in the wild by sampling deep similarity functions

Visual recognition in the wild by sampling deep similarity functions

Superquadrics Revisited: Learning 3D Shape Parsing Beyond Cuboids

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Contact Info

Product

Resources

About