Abstract:Current perception models in autonomous driving heavily rely on large-scale labelled 3D data, which is both costly and time-consuming to annotate. This work proposes a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds using masked autoencoders (MAE). While existing masked point autoencoding methods mainly focus on small-scale indoor point clouds or pillar-based large-scale outdoor LiDAR data, our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE, specifically designed for voxel-based large-scale outdoor LiDAR point clouds. Occupancy-MAE takes advantage of the gradually sparse voxel occupancy structure of outdoor LiDAR point clouds and incorporates a range-aware random masking strategy and a pretext task of occupancy prediction. By randomly masking voxels based on their distance to the LiDAR and predicting the masked occupancy structure of the entire 3D surrounding scene, Occupancy-MAE encourages the extraction of high-level semantic information to reconstruct the masked voxel using only a small number of visible voxels. Extensive experiments demonstrate the effectiveness of Occupancy-MAE across several downstream tasks. For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half and improves small object detection by approximately 2% in AP on the Waymo dataset. For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU. For multi-object tracking, Occupancy-MAE enhances training from scratch by approximately 1% in terms of AMOTA and AMOTP. Codes are publicly available at <a class="link-external link-https" href="https://github.com/chaytonmin/Occupancy-MAE" rel="external noopener nofollow">this https URL</a>.

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations

Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders

PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

SP-Det: Leveraging Saliency Prediction for Voxel-Based 3D Object Detection in Sparse Point Cloud

Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

OPUS: Occupancy Prediction Using a Sparse Set

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

AdaptiveOcc: Adaptive Octree-based Network for Multi-Camera 3D Semantic Occupancy Prediction in Autonomous Driving

3Dopformer: 3D Occupancy Perception from Multi-Camera Images with Directional and Distance Enhancement

Learning Occupancy for Monocular 3D Object Detection

Fully Sparse 3D Occupancy Prediction

Receding Moving Object Segmentation in 3D LiDAR Data Using Sparse 4D Convolutions

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection