Do We Really Need Scene-specific Pose Encoders?

Shavit, Yoli; Ferens, Ron

doi:10.1109/icpr48806.2021.9412225

Cited by 16 publications

(14 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To the best of our knowledge, the paradigm of regressing the camera pose from the final output of a CNN backbone was adopted by all regressors to date [30]. Variations to the architecture focused on alternatives to the original proposed CNN backbone [20,21,38,31] and on deeper, branching architectures for the MLP head [38,21]. Other works tried to address overfitting by averaging over predictions from models with randomly dropped activations [14] or by reducing the dimensionality of the global image encoding with Long-Short-Term-Memory (LSTM) layers [36].…”

Section: Image-based Camera Pose Estimationmentioning

confidence: 99%

“…This formulation was adopted by many pose regressors, however it still requires manually tuning the parameters' initialization for different datasets [34]. In a recent work [31], the authors trained the model separately for position and orientation in order to reduce the need of additional parameters, while achieving comparable accuracy. Alternative representations for the orientation were also proposed to gain better balance and stability of the pose loss [38,5].…”

Section: Image-based Camera Pose Estimationmentioning

confidence: 99%

“…where s x and s q are learned parameters. Recently, Shavit and Ferens showed the advantage of separately learning each task on its own [31]. Here, we follow a combined approach, where we first train the entire model to minimize Eq.…”

Section: Camera Pose Lossmentioning

confidence: 99%

“…The appealing 5 ms runtime and simplicity (a single component instead of a heavy pipeline) paved the way to a new research paradigm for camera pose estimation. Numerous methods, soon to follow, aimed at maintaining the low runtime and memory requirements, while improving the accuracy and generalization of the original method [16,14,15,20,21,36,38,31,37,6].…”

Section: Introductionmentioning

confidence: 99%

“…Different camera pose regressors suggested different backbones [20,31], loss formulations [15] and MLP architectures [38,21] as well as additional manipulations of the output [36,37]. Common to all these methods, is that they apply the pose regression using a single global image encoding computed by the backbone CNN.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Paying Attention to Activation Maps in Camera Pose Regression

Shavit¹,

Ferens²,

Keller³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The proposed attention-based regression localization scheme. The input image is first encoded by a convolutional backbone. Two activation maps, at different resolutions, are transformed into sequential representations. The two activation sequences are analyzed by dual Transformer encoders, one per regression task. We depict the attention weights via heatmaps. Position is best estimated by corner-like image features, while orientation is estimated by edge-like features. Each Transformer encoder output is used to regress the respective camera pose component (position x or orientation q).

show abstract