Learning Multi-Scene Absolute Pose Regression with Transformers

Shavit, Yoli; Ferens, Ron; Keller, Yosi

doi:10.48550/arxiv.2103.11468

Cited by 1 publication

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach performs similar to existing APR and RPR techniques that also use only a single forward pass in a network [1,8,30,60], but worse than iterative approaches such as [19] or methods that use more densely spaced synthetic views as additional input [41]. Note that these approaches that do not use 3D scene geometry are less accurate than state-of-the-art methods based on 2D-3D correspondences [7,56,58].…”

Section: Methodsmentioning

confidence: 80%

“…Since this task can be considered an "inverse" of the novel view synthesis task [70], we consider the ability to perform both tasks via the same model to be an intriguing property. Even though the localization results are not yet competitive with state-of-the-art localization pipelines, we achieve a similar level of pose accuracy as comparable methods such as [1,60].…”

Section: Introductionmentioning

confidence: 82%

“…Pose regression methods train a convolutional neural network (CNN) to regress the camera pose of an input image. There are two categories: absolute pose regression (APR) methods [5,8,14,28,30,33,41,60] and relative pose regression (RPR) methods [1,19,31,33,39]. It was shown [59] that APR is often not (much) more accurate than IR.…”

Section: Visual Localizationmentioning

confidence: 99%

See 2 more Smart Citations

ViewFormer: NeRF-Free Neural Rendering from Few Images Using Transformers

Kulhánek

Derner

Sattler

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Field (NeRF), and while achieving impressive results, the methods suffer from long training times as they require evaluating millions of 3D point samples via a neural network for each image. We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning explicitly in 3D, and it is faster to train.

show abstract