Abstract:Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new self-supervised pre-training method—NeRF-MAE (Neural Radiance Fields Masked AutoEncoders) to address the following issues: 1. **Large-scale self-supervised pre-training using Masked AutoEncoders (MAE)**: The paper attempts to use Masked AutoEncoders for self-supervised pre-training of NeRF's (Neural Radiance Fields) radiance and density grids to generate effective 3D representations. 2. **Improving performance on 3D downstream tasks**: The representations obtained through this pre-training method can significantly improve performance in downstream tasks such as 3D object detection, super-resolution reconstruction, and voxel labeling. Specifically, the paper addresses the following key issues: - How to apply Masked AutoEncoders to NeRF's radiance and density grids to achieve efficient 3D scene representation learning; - How to use large-scale unlabeled image data for pre-training to improve the learning effect of subsequent specific tasks; - Whether the proposed method can significantly improve performance on multiple challenging 3D tasks compared to existing self-supervised pre-training methods. The main contributions of the paper include: - Proposing the first fully self-supervised and transformer-based 3D pre-training method, which directly uses NeRF's radiance and density grids as input modalities and adopts a transparency-aware masking reconstruction objective. - Constructing a large-scale pre-training dataset containing over 1.8 million images and more than 3,600 indoor scenes for NeRF pre-training. - Experimental results show that the proposed NeRF-MAE method significantly outperforms existing self-supervised pre-training baselines and other NeRF scene understanding methods on multiple downstream 3D tasks, improving AP50 by 21.5% and AP25 by 8% on the 3D object detection task, and improving mAcc by 12% on the semantic voxel labeling task, while requiring only half the data needed by the current best methods.

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

MPS-NeRF: Generalizable 3D Human Rendering from Multiview Images

Mask-Based Modeling for Neural Radiance Fields

Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations

Pre-NeRF 360: Enriching Unbounded Appearances for Neural Radiance Fields

Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder

Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields

MultiPlaneNeRF: Neural Radiance Field with Non-Trainable Representation

NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images

DA4NeRF: Depth-aware augmentation technique for neural radiance fields

AutoNeRF: Training Implicit Scene Representations with Autonomous Agents

AE-NeRF: Auto-Encoding Neural Radiance Fields for 3D-Aware Object Manipulation

NeRF-In: Free-Form Inpainting for Pretrained NeRF With RGB-D Priors

DaRF: Boosting Radiance Fields from Sparse Inputs with Monocular Depth Adaptation

Drone-NeRF: Efficient NeRF based 3D scene reconstruction for large-scale drone survey

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes