CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration

Gongxin Yao,Yixin Xuan,Xinyang Li,Yu Pan
2024-08-05
Abstract:Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud. It plays an important role in camera localization within pre-built LiDAR maps. Despite the modality gaps, most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization, resulting in poor accuracy and interpretability. In this paper, we propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose based on each intermediate state. To achieve this, we employ reinforcement learning to develop a cross-modal registration agent (CMR-Agent), and use imitation learning to initialize its registration policy for stability and quick-start of the training. According to the cross-modal observations, we propose a 2D-3D hybrid state representation that fully exploits the fine-grained features of RGB images while reducing the useless neutral states caused by the spatial truncation of camera frustum. Additionally, the overall framework is well-designed to efficiently reuse one-shot cross-modal embeddings, avoiding repetitive and time-consuming feature extraction. Extensive experiments on the KITTI-Odometry and NuScenes datasets demonstrate that CMR-Agent achieves competitive accuracy and efficiency in registration. Once the one-shot embeddings are completed, each iteration only takes a few milliseconds.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper primarily addresses the problem of image-to-point cloud registration, aiming to determine the relative camera pose of an RGB image with respect to a point cloud. This issue plays a crucial role in camera localization based on pre-built LiDAR maps. The authors point out that although existing learning-based methods can establish correspondences between 2D-3D points in the feature space, these methods lack a feedback mechanism for iterative optimization, resulting in low accuracy and poor interpretability. To solve this problem, the paper proposes a new framework that redefines the registration process as a Markov Decision Process and develops a cross-modal registration agent (CMR-Agent) using reinforcement learning, which can iteratively adjust the camera pose based on the current state. Specifically, the main contributions of the paper are as follows: 1. **A new method for cross-modal registration**: The paper proposes redefining the image-to-point cloud registration problem as a Markov Decision Process and training a fully cross-modal registration agent by combining reinforcement learning and imitation learning. 2. **2D-3D hybrid state representation**: To fully utilize the fine-grained features of the image and reduce useless neutral states (ineffective states caused by spatial truncation of the camera's field of view), the paper proposes a 2D-3D hybrid state representation method. 3. **Point-to-point alignment reward**: To guide the training of the agent, the paper designs a point-to-point alignment reward function to measure the alignment between 3D points and 3D points obtained from 2D pixel back-projection. 4. **Efficient framework design**: To avoid repetitive and time-consuming feature extraction processes, the paper carefully designs the overall framework so that the agent can reuse one-time cross-modal embeddings, thereby reducing the time complexity in iterations. Experimental results show that the CMR-Agent outperforms existing techniques in registration accuracy and efficiency on the KITTI-Odometry and NuScenes datasets. Even as an iterative method, it takes less than 68 milliseconds to perform 10 iterations, demonstrating high efficiency while maintaining high accuracy.