2021
DOI: 10.48550/arxiv.2103.11477
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Paying Attention to Activation Maps in Camera Pose Regression

Abstract: The proposed attention-based regression localization scheme. The input image is first encoded by a convolutional backbone. Two activation maps, at different resolutions, are transformed into sequential representations. The two activation sequences are analyzed by dual Transformer encoders, one per regression task. We depict the attention weights via heatmaps. Position is best estimated by corner-like image features, while orientation is estimated by edge-like features. Each Transformer encoder output is used t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 29 publications
0
1
0
Order By: Relevance
“…Regression-based localization trains a neural network and takes the network parameters as a global map representation which can directly regress the 6-DOF camera poses (Kendall, Grimes, and Cipolla 2015;Balntas, Li, and Prisacariu 2018;Kendall and Cipolla 2017;Moreau et al 2022a;Shavit, Ferens, and Keller 2021) or the 3D scene coordinate of each pixel (Cavallari et al 2017;Li et al 2020;Yang et al 2019) by taking the query image as the network input. For the simplicity and end-to-end training manner, these methods have attracted considerable attention.…”
Section: Introductionmentioning
confidence: 99%
“…Regression-based localization trains a neural network and takes the network parameters as a global map representation which can directly regress the 6-DOF camera poses (Kendall, Grimes, and Cipolla 2015;Balntas, Li, and Prisacariu 2018;Kendall and Cipolla 2017;Moreau et al 2022a;Shavit, Ferens, and Keller 2021) or the 3D scene coordinate of each pixel (Cavallari et al 2017;Li et al 2020;Yang et al 2019) by taking the query image as the network input. For the simplicity and end-to-end training manner, these methods have attracted considerable attention.…”
Section: Introductionmentioning
confidence: 99%