Abstract:We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at <a class="link-external link-https" href="https://distillnerf.github.io/" rel="external noopener nofollow">this https URL</a>.

NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields

LabelFusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes

One-Shot Neural Fields for 3D Object Understanding

Semi-Automatic Labeling for Deep Learning in Robotics

PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes

Panoptic NeRF: 3D-to-2d Label Transfer for Panoptic Urban Scene Segmentation

LaTeRF: Label and Text Driven Object Radiance Fields

LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories

EasyLabel: A Semi-Automatic Pixel-wise Object Annotation Tool for Creating Robotic RGB-D Datasets

SAID-NeRF: Segmentation-AIDed NeRF for Depth Completion of Transparent Objects

NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Rapid Pose Label Generation through Sparse Representation of Unknown Objects

Labeling 3D scenes for Personal Assistant Robots

NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views

Efficient 3D Instance Mapping and Localization with Neural Fields

A 3D-Deep-Learning-based Augmented Reality Calibration Method for Robotic Environments using Depth Sensor Data

Three-Dimensional Object Segmentation Method based on YOLO, SAM, and NeRF

DiscoNeRF: Class-Agnostic Object Field for 3D Object Discovery

NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising