Leveraging Image Matching Toward End-to-End Relative Camera Pose Regression

Fadi Khatib,Yuval Margalit,Meirav Galun,Ronen Basri
2024-04-16
Abstract:This paper proposes a generalizable, end-to-end deep learning-based method for relative pose regression between two images. Given two images of the same scene captured from different viewpoints, our method predicts the relative rotation and translation (including direction and scale) between the two respective cameras. Inspired by the classical pipeline, our method leverages Image Matching (IM) as a pre-trained task for relative pose regression. Specifically, we use LoFTR, an architecture that utilizes an attention-based network pre-trained on Scannet, to extract semi-dense feature maps, which are then warped and fed into a pose regression network. Notably, we use a loss function that utilizes separate terms to account for the translation direction and scale. We believe such a separation is important because translation direction is determined by point correspondences while the scale is inferred from prior on shape sizes. Our ablations further support this choice. We evaluate our method on several datasets and show that it outperforms previous end-to-end methods. The method also generalizes well to unseen datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of relative pose (including position and orientation) estimation in computer vision. Specifically, the paper proposes an end-to-end deep learning method for predicting the relative rotation and translation (including direction and scale) between cameras from images of the same scene taken from two different viewpoints. This method leverages image matching as a pre-training task to enhance the performance of relative pose regression. The main contributions are as follows: 1. Utilizing image matching (IM) as a pre-training task for relative pose regression, a novel end-to-end relative pose estimation framework is proposed. 2. A new loss function is introduced, which separates the direction and scale of the camera position vector, using cosine similarity and L1 loss for training, respectively. 3. Hard matching and deformation are used instead of soft matching and deformation, and the advantages of this approach are demonstrated. 4. The effectiveness of the feature representations generated by the pre-trained IM backbone network is validated, and the role of interleaved self-attention and cross-attention modules in capturing feature similarity between image pairs is emphasized. 5. The method is tested on multiple datasets, including cases where the training and testing datasets are different, and the results show that this method outperforms other end-to-end relative pose regression networks in almost all experiments. 6. The method significantly narrows the performance gap between relative pose regression and feature matching methods while maintaining faster inference speed.