CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Tangfei Liao,Xiaoqin Zhang,Guobao Xiao,Min Li,Tao Wang,Mang Ye
2024-06-09
Abstract:Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, \ie correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the **pretraining challenges in the correspondence pruning task**. Specifically, the authors propose innovative solutions to the following two main problems: 1. **High pretraining cost and limited data**: - The correspondence pruning task usually depends on Graph Neural Networks (GNNs), and these networks are very computationally expensive when dealing with large - scale data. Traditional pretraining methods, such as fully - supervised learning through classification tasks, are effective but costly, especially when dealing with long - sequence data or a large number of initial corresponding points. - The expansion of the dataset further exacerbates this problem, making direct training from scratch on the target dataset the only viable option. In addition, the lack of additional data makes it impossible for traditional pretraining methods to provide valuable prior knowledge for correspondence pruning. 2. **How to effectively reconstruct occluded corresponding points**: - The key to correspondence pruning lies in accurately identifying true corresponding points (inliers) and restoring the two - view geometric relationship. However, due to the disorder and irregularity of corresponding points, the traditional Masked Autoencoder (MAE) framework is difficult to be directly applied to the reconstruction task of corresponding points. - Existing image - based MAE methods will ignore the position information of corresponding points when dealing with them, resulting in ineffective reconstruction results. ### Proposed solutions To address the above challenges, the authors propose a new framework named **CorrMAE (Correspondence Masked Autoencoder)**, with the following specific contributions: 1. **Introduced a pretraining method for correspondence pruning**: - Obtain a general inlier - consistent representation through a reconstruction task, that is, obtain a strong initial representation by reconstructing occluded corresponding points, thereby enhancing the performance of downstream tasks. - This method significantly reduces the cost of pretraining and can improve model performance even without additional data. 2. **Designed a novel CorrMAE framework**: - CorrMAE includes four main stages: corresponding point occlusion, corresponding point learning, source/target matching point reconstruction, and supervision. - Introduced a two - branch structure and clever position encoding to reconstruct the matching points in the source and target images respectively, indirectly achieving the reconstruction of occluded corresponding points. - Proposed an alignment loss for supervising the reconstructed matching points to ensure that the corresponding points between the source and target images are consistent. 3. **Achieved excellent experimental results**: - Extensive experiments show that the model pretrained with CorrMAE outperforms existing methods in multiple challenging benchmarks, especially in camera pose estimation and visual localization tasks. Through these innovations, the authors not only solve the pretraining problems in the correspondence pruning task but also provide a new starting point for future research.