NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Muhammad Zubair Irshad,Sergey Zakharov,Vitor Guizilini,Adrien Gaidon,Zsolt Kira,Rares Ambrus
2024-07-19
Abstract:Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main goal of this paper is to propose a new self-supervised pre-training method—NeRF-MAE (Neural Radiance Fields Masked AutoEncoders) to address the following issues: 1. **Large-scale self-supervised pre-training using Masked AutoEncoders (MAE)**: The paper attempts to use Masked AutoEncoders for self-supervised pre-training of NeRF's (Neural Radiance Fields) radiance and density grids to generate effective 3D representations. 2. **Improving performance on 3D downstream tasks**: The representations obtained through this pre-training method can significantly improve performance in downstream tasks such as 3D object detection, super-resolution reconstruction, and voxel labeling. Specifically, the paper addresses the following key issues: - How to apply Masked AutoEncoders to NeRF's radiance and density grids to achieve efficient 3D scene representation learning; - How to use large-scale unlabeled image data for pre-training to improve the learning effect of subsequent specific tasks; - Whether the proposed method can significantly improve performance on multiple challenging 3D tasks compared to existing self-supervised pre-training methods. The main contributions of the paper include: - Proposing the first fully self-supervised and transformer-based 3D pre-training method, which directly uses NeRF's radiance and density grids as input modalities and adopts a transparency-aware masking reconstruction objective. - Constructing a large-scale pre-training dataset containing over 1.8 million images and more than 3,600 indoor scenes for NeRF pre-training. - Experimental results show that the proposed NeRF-MAE method significantly outperforms existing self-supervised pre-training baselines and other NeRF scene understanding methods on multiple downstream 3D tasks, improving AP50 by 21.5% and AP25 by 8% on the 3D object detection task, and improving mAcc by 12% on the semantic voxel labeling task, while requiring only half the data needed by the current best methods.