Leveraging Image Matching Toward End-to-End Relative Camera Pose Regression

Fadi Khatib,Yuval Margalit,Meirav Galun,Ronen Basri

2024-04-16

Abstract:This paper proposes a generalizable, end-to-end deep learning-based method for relative pose regression between two images. Given two images of the same scene captured from different viewpoints, our method predicts the relative rotation and translation (including direction and scale) between the two respective cameras. Inspired by the classical pipeline, our method leverages Image Matching (IM) as a pre-trained task for relative pose regression. Specifically, we use LoFTR, an architecture that utilizes an attention-based network pre-trained on Scannet, to extract semi-dense feature maps, which are then warped and fed into a pose regression network. Notably, we use a loss function that utilizes separate terms to account for the translation direction and scale. We believe such a separation is important because translation direction is determined by point correspondences while the scale is inferred from prior on shape sizes. Our ablations further support this choice. We evaluate our method on several datasets and show that it outperforms previous end-to-end methods. The method also generalizes well to unseen datasets.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the problem of relative pose (including position and orientation) estimation in computer vision. Specifically, the paper proposes an end-to-end deep learning method for predicting the relative rotation and translation (including direction and scale) between cameras from images of the same scene taken from two different viewpoints. This method leverages image matching as a pre-training task to enhance the performance of relative pose regression. The main contributions are as follows: 1. Utilizing image matching (IM) as a pre-training task for relative pose regression, a novel end-to-end relative pose estimation framework is proposed. 2. A new loss function is introduced, which separates the direction and scale of the camera position vector, using cosine similarity and L1 loss for training, respectively. 3. Hard matching and deformation are used instead of soft matching and deformation, and the advantages of this approach are demonstrated. 4. The effectiveness of the feature representations generated by the pre-trained IM backbone network is validated, and the role of interleaved self-attention and cross-attention modules in capturing feature similarity between image pairs is emphasized. 5. The method is tested on multiple datasets, including cases where the training and testing datasets are different, and the results show that this method outperforms other end-to-end relative pose regression networks in almost all experiments. 6. The method significantly narrows the performance gap between relative pose regression and feature matching methods while maintaining faster inference speed.

Leveraging Image Matching Toward End-to-End Relative Camera Pose Regression

Insights on Evaluation of Camera Re-localization Using Relative Pose Regression

Regression-Based Camera Pose Estimation through Multi-Level Local Features and Global Features

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

Learning to Localize in Unseen Scenes with Relative Pose Regressors

Improving the generalization of network based relative pose regression: dimension reduction as a regularizer

Relative Pose Estimation of Visual SLAM Based on Convolutional Neural Networks

Understanding the Limitations of CNN-based Absolute Camera Pose Regression

SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

End-to-end Monocular Pose Estimation for Uncooperative Spacecraft Based on Direct Regression Network

Towards Better Generalization: Joint Depth-Pose Learning without PoseNet

Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference

Global Localization: Utilizing Relative Spatio-Temporal Geometric Constraints from Adjacent and Distant Cameras

Handbook on Leveraging Lines for Two-View Relative Pose Estimation

Learning single and multi-scene camera pose regression with transformer encoders

Linear Relative Pose Estimation Founded on Pose-only Imaging Geometry

Homography-Based Loss Function for Camera Pose Regression

A generalizable approach for multi-view 3D human pose regression

Relative geometry-aware siamese neural network for 6DOF camera relocalization

Relative Pose Estimation for RGB-D Human Input Scans Via Implicit Function Reconstruction