Abstract:Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud. It plays an important role in camera localization within pre-built LiDAR maps. Despite the modality gaps, most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization, resulting in poor accuracy and interpretability. In this paper, we propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose based on each intermediate state. To achieve this, we employ reinforcement learning to develop a cross-modal registration agent (CMR-Agent), and use imitation learning to initialize its registration policy for stability and quick-start of the training. According to the cross-modal observations, we propose a 2D-3D hybrid state representation that fully exploits the fine-grained features of RGB images while reducing the useless neutral states caused by the spatial truncation of camera frustum. Additionally, the overall framework is well-designed to efficiently reuse one-shot cross-modal embeddings, avoiding repetitive and time-consuming feature extraction. Extensive experiments on the KITTI-Odometry and NuScenes datasets demonstrate that CMR-Agent achieves competitive accuracy and efficiency in registration. Once the one-shot embeddings are completed, each iteration only takes a few milliseconds.

What problem does this paper attempt to address?

The paper primarily addresses the problem of image-to-point cloud registration, aiming to determine the relative camera pose of an RGB image with respect to a point cloud. This issue plays a crucial role in camera localization based on pre-built LiDAR maps. The authors point out that although existing learning-based methods can establish correspondences between 2D-3D points in the feature space, these methods lack a feedback mechanism for iterative optimization, resulting in low accuracy and poor interpretability. To solve this problem, the paper proposes a new framework that redefines the registration process as a Markov Decision Process and develops a cross-modal registration agent (CMR-Agent) using reinforcement learning, which can iteratively adjust the camera pose based on the current state. Specifically, the main contributions of the paper are as follows: 1. **A new method for cross-modal registration**: The paper proposes redefining the image-to-point cloud registration problem as a Markov Decision Process and training a fully cross-modal registration agent by combining reinforcement learning and imitation learning. 2. **2D-3D hybrid state representation**: To fully utilize the fine-grained features of the image and reduce useless neutral states (ineffective states caused by spatial truncation of the camera's field of view), the paper proposes a 2D-3D hybrid state representation method. 3. **Point-to-point alignment reward**: To guide the training of the agent, the paper designs a point-to-point alignment reward function to measure the alignment between 3D points and 3D points obtained from 2D pixel back-projection. 4. **Efficient framework design**: To avoid repetitive and time-consuming feature extraction processes, the paper carefully designs the overall framework so that the agent can reuse one-time cross-modal embeddings, thereby reducing the time complexity in iterations. Experimental results show that the CMR-Agent outperforms existing techniques in registration accuracy and efficiency on the KITTI-Odometry and NuScenes datasets. Even as an iterative method, it takes less than 68 milliseconds to perform 10 iterations, demonstrating high efficiency while maintaining high accuracy.

CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration

AgentI2P: Optimizing Image-to-Point Cloud Registration Via Behaviour Cloning and Reinforcement Learning.

A Spatiotemporal Agent for Robust Multimodal Registration

Cross-Modal Information-Guided Network Using Contrastive Learning for Point Cloud Registration

CFI2P: Coarse-to-Fine Cross-Modal Correspondence Learning for Image-to-Point Cloud Registration

Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching

ReAgent: Point Cloud Registration using Imitation and Reinforcement Learning

An Artificial Agent for Robust Image Registration

End-to-end multimodal image registration via reinforcement learning

DeepICP: An End-to-End Deep Neural Network for 3D Point Cloud Registration

General cross-modality registration framework for visible and infrared UAV target image registration

PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

A Novel Method for Registration of MLS and Stereo Reconstructed Point Clouds

Robust Multimodal Image Registration Using Deep Recurrent Reinforcement Learning

Automatic Image-to-Color Point Cloud Cross-modal Registration Based on Graph Neural Networks and Iterative Reprojection.

Image-to-Point Registration Via Cross-Modality Correspondence Retrieval

A General Framework for Flexible Multi-Cue Photometric Point Cloud Registration

A Semi-Supervised Image Registration Framework Based on Multimodal Cross-Attention

Learning multiview 3D point cloud registration

PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point Cloud Registration