Abstract:This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning via masked multi-modal reconstruction in Neural Radiance Field (NeRF), namely NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on certain view directions and locations, multi-modal embeddings extracted from corrupted multi-modal input signals, i.e., Lidar point clouds and images, are rendered into projected multi-modal feature maps via neural rendering. Then, original multi-modal signals serve as reconstruction targets for the rendered multi-modal feature maps to enable self-supervised representation learning. Extensive experiments show that the representation learned via NS-MAE shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models on diverse 3D perception downstream tasks (3D object detection and BEV map segmentation) with diverse amounts of fine-tuning labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of both the mechanism of masked autoencoder and neural radiance field. We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to design a unified self - supervised pre - training framework to learn transferable multi - modal perception representations, thereby improving the performance of multi - modal and uni - modal perception models in 3D perception tasks?** Specifically, the paper proposes a framework named **NeRF - Supervised Masked AutoEncoder (NS - MAE)**, aiming to solve the following problems in the following ways: 1. **The problem of scarce data annotation**: Current multi - modal perception models usually rely on a large amount of annotated data for fully - supervised training, but high - quality 3D annotated data (such as paired images and sparse LiDAR point clouds) are very scarce. This makes it difficult for traditional fully - supervised training methods to be scaled. 2. **Lack of a unified multi - modal representation learning method**: Most of the existing self - supervised pre - training methods are aimed at uni - modal perception models, and the optimization formulas are not unified, and there is no pre - training method specifically for multi - modal perception models. ### Main contributions of the paper: 1. **Propose a new unified self - supervised pre - training framework** (NS - MAE), which is suitable for multi - modal and uni - modal perception models. 2. **Achieve the unification of self - supervision and optimization in multi - modal representation learning**. By introducing the rendering mechanism of Neural Radiance Field (NeRF), the multi - modal reconstruction process is made more concise and general. 3. **Verify the effectiveness of NS - MAE**. Experiments are carried out on several advanced uni - modal and multi - modal perception models, showing its transferability and performance improvement in different 3D perception tasks (such as 3D object detection and BEV map segmentation). ### Workflow of the framework: 1. **Masking**: Partially mask the input images and voxelized LiDAR point clouds respectively. 2. **Rendering**: The embeddings extracted from the masked images and point clouds are rendered into color and projected point cloud feature maps, which are realized by neural rendering. 3. **Reconstruction**: The rendering results are optimized by multi - modal reconstruction, supervised by the original images and point clouds, ensuring that the embedding network can learn representations end - to - end. In this way, NS - MAE can not only effectively utilize unannotated multi - modal data, but also improve the performance of downstream tasks, especially in the case of limited annotated data.

Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Mask-Based Modeling for Neural Radiance Fields

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

AutoNeRF: Training Implicit Scene Representations with Autonomous Agents

Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields

A Multi-Modal Unified Representation Learning Framework with Masked Image Modeling for Remote Sensing Images

Inter-Modal Masked Autoencoder for Self-Supervised Learning on Point Clouds

MultiMAE: Multi-modal Multi-task Masked Autoencoders

NeRF-MS: Neural Radiance Fields with Multi-Sequence.

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

MMNeRF: Multi-Modal and Multi-View Optimized Cross-Scene Neural Radiance Fields

Distributed NeRF Learning for Collaborative Multi-Robot Perception

3DMAE: Joint SAR and Optical Representation Learning with Vertical Masking.

NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping

Towards Embodied Neural Radiance Fields

Masked Autoencoders in 3D Point Cloud Representation Learning

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing