Learning Multi-Scene Absolute Pose Regression with Transformers

Shavit, Yoli; Ferens, Ron; Keller, Yosi

doi:10.1109/iccv48922.2021.00273

Cited by 76 publications

(34 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…conclude that the proposed E-PoseNet achieves the lowest location error across all the outdoor and indoor scenes, and the lowest orientation error across the majority of them. It also competes with most recent transformer-based architectures [15,16] on these datasets.…”

Section: Datasetsmentioning

confidence: 89%

Leveraging Equivariant Features for Absolute Pose Regression

Musallam¹,

Gaudillière²,

Castillo³

et al. 2022

Preprint

View full text Add to dashboard Cite

While end-to-end approaches have achieved state-ofthe-art performance in many perception tasks, they are not yet able to compete with 3D geometry-based methods in pose estimation. Moreover, absolute pose regression has been shown to be more related to image retrieval. As a result, we hypothesize that the statistical features learned by classical Convolutional Neural Networks do not carry enough geometric information to reliably solve this inherently geometric task. In this paper, we demonstrate how a translation and rotation equivariant Convolutional Neural Network directly induces representations of camera motions into the feature space. We then show that this geometric property allows for implicitly augmenting the training data under a whole group of image plane-preserving transformations. Therefore, we argue that directly learning equivariant features is preferable than learning data-intensive intermediate representations. Comprehensive experimental validation demonstrates that our lightweight model outperforms existing ones on standard datasets. 1

show abstract

Section: Datasetsmentioning

confidence: 89%

Leveraging Equivariant Features for Absolute Pose Regression

Musallam¹,

Gaudillière²,

Castillo³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…PoseNet has been further improved by combining CNNs and LSTMs for feature correlation [43], introducing temporal information [6], incorporating spatial constraints [1] or by adding additional covisibility constraints based on local maps and the estimated odometry [49]. MS-Transformer [36] is a recent relocalization work based on transformer architecture, achieving the stateof-the-art results.…”

Section: A Learning-based Pose Estimationmentioning

confidence: 99%

“…We follow the official data split to train and test our models above this dataset. Task Baselines: Our SelectFusion model is built as an end-toend relocalization model, and thus we compare with LSTM-Pose [43], VidLoc [6], and MS-Transformer [36] which are representative within this category of learning techniques.…”

Section: A Experimental Setupsmentioning

confidence: 99%

See 1 more Smart Citation

Learning Selective Sensor Fusion for State Estimation

Chen

Rosa

et al. 2024

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

“…An alternative to explicitly representing the 3D scene geometry via a 3D model is to implicitly store information about the scene in the weights of a machine learning model. Examples include scene coordinate regression techniques [9, 10, 12, 14-16, 75, 90], which regress 2D-3D matches rather than computing them via explicit descriptor matching, and absolute [34,35,48,74,93] and relative pose [3,23,37] regressors. Scene coordinate regressors achieve state-ofthe-art results for small scenes [8], but have not yet shown strong performance in more challenging scenes.…”

Section: Introductionmentioning

confidence: 99%

MeshLoc: Mesh-Based Visual Localization

Panek¹,

Kúkelová²,

Sattler³

2022

Preprint

View full text Add to dashboard Cite

Visual localization, i.e., the problem of camera pose estimation, is a central component of applications such as autonomous robots and augmented reality systems. A dominant approach in the literature, shown to scale to large scenes and to handle complex illumination and seasonal changes, is based on local features extracted from images. The scene representation is a sparse Structure-from-Motion point cloud that is tied to a specific local feature. Switching to another feature type requires an expensive feature matching step between the database images used to construct the point cloud. In this work, we thus explore a more flexible alternative based on dense 3D meshes that does not require features matching between database images to build the scene representation. We show that this approach can achieve state-of-the-art results. We further show that surprisingly competitive results can be obtained when extracting features on renderings of these meshes, without any neural rendering stage, and even when rendering raw scene geometry without color or texture. Our results show that dense 3D model-based representations are a promising alternative to existing representations and point to interesting and challenging directions for future research.

show abstract

Learning Multi-Scene Absolute Pose Regression with Transformers

Cited by 76 publications

References 43 publications

Leveraging Equivariant Features for Absolute Pose Regression

Leveraging Equivariant Features for Absolute Pose Regression

Learning Selective Sensor Fusion for State Estimation

MeshLoc: Mesh-Based Visual Localization

Contact Info

Product

Resources

About