Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder

Xiaohao Xu
2023-12-06
Abstract:This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning via masked multi-modal reconstruction in Neural Radiance Field (NeRF), namely NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on certain view directions and locations, multi-modal embeddings extracted from corrupted multi-modal input signals, i.e., Lidar point clouds and images, are rendered into projected multi-modal feature maps via neural rendering. Then, original multi-modal signals serve as reconstruction targets for the rendered multi-modal feature maps to enable self-supervised representation learning. Extensive experiments show that the representation learned via NS-MAE shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models on diverse 3D perception downstream tasks (3D object detection and BEV map segmentation) with diverse amounts of fine-tuning labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of both the mechanism of masked autoencoder and neural radiance field. We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to design a unified self - supervised pre - training framework to learn transferable multi - modal perception representations, thereby improving the performance of multi - modal and uni - modal perception models in 3D perception tasks?** Specifically, the paper proposes a framework named **NeRF - Supervised Masked AutoEncoder (NS - MAE)**, aiming to solve the following problems in the following ways: 1. **The problem of scarce data annotation**: Current multi - modal perception models usually rely on a large amount of annotated data for fully - supervised training, but high - quality 3D annotated data (such as paired images and sparse LiDAR point clouds) are very scarce. This makes it difficult for traditional fully - supervised training methods to be scaled. 2. **Lack of a unified multi - modal representation learning method**: Most of the existing self - supervised pre - training methods are aimed at uni - modal perception models, and the optimization formulas are not unified, and there is no pre - training method specifically for multi - modal perception models. ### Main contributions of the paper: 1. **Propose a new unified self - supervised pre - training framework** (NS - MAE), which is suitable for multi - modal and uni - modal perception models. 2. **Achieve the unification of self - supervision and optimization in multi - modal representation learning**. By introducing the rendering mechanism of Neural Radiance Field (NeRF), the multi - modal reconstruction process is made more concise and general. 3. **Verify the effectiveness of NS - MAE**. Experiments are carried out on several advanced uni - modal and multi - modal perception models, showing its transferability and performance improvement in different 3D perception tasks (such as 3D object detection and BEV map segmentation). ### Workflow of the framework: 1. **Masking**: Partially mask the input images and voxelized LiDAR point clouds respectively. 2. **Rendering**: The embeddings extracted from the masked images and point clouds are rendered into color and projected point cloud feature maps, which are realized by neural rendering. 3. **Reconstruction**: The rendering results are optimized by multi - modal reconstruction, supervised by the original images and point clouds, ensuring that the embedding network can learn representations end - to - end. In this way, NS - MAE can not only effectively utilize unannotated multi - modal data, but also improve the performance of downstream tasks, especially in the case of limited annotated data.