Abstract:It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks. Previous works mainly focus on the self-supervised pre-training pipeline, meaning that they perform the pre-training and fine-tuning on the same benchmark, which is difficult to attain the performance scalability and cross-dataset application for the pre-training checkpoint. In this paper, for the first time, we are committed to building a large-scale pre-training point-cloud dataset with diverse data distribution, and meanwhile learning generalizable representations from such a diverse pre-training dataset. We formulate the point-cloud pre-training task as a semi-supervised problem, which leverages the few-shot labeled and massive unlabeled point-cloud data to generate the unified backbone representations that can be directly applied to many baseline models and benchmarks, decoupling the AD-related pre-training process and downstream fine-tuning task. During the period of backbone pre-training, by enhancing the scene- and instance-level distribution diversity and exploiting the backbone's ability to learn from unknown instances, we achieve significant performance gains on a series of downstream perception benchmarks including Waymo, nuScenes, and KITTI, under different baseline models like PV-RCNN++, SECOND, CenterPoint.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of pre-training autonomous driving (AD) perception models on large-scale point cloud datasets to obtain a unified representation that performs well across different tasks or benchmarks. Specifically, existing research mainly focuses on self-supervised pre-training pipelines, i.e., pre-training and fine-tuning on the same benchmark dataset, which makes it difficult to achieve performance scalability and cross-dataset application. This paper is the first to focus on constructing a large-scale pre-training point cloud dataset with diverse data distributions and learning general representations from such diverse pre-training datasets. ### Main Contributions 1. **Proposing the AD-PT Paradigm**: This is the first time the AD-PT paradigm is proposed, aiming to learn a unified representation by pre-training a general backbone network and transferring the knowledge to various benchmarks. 2. **Diverse Pre-training Data Preparation**: A diverse pre-training data preparation process and unknown instance learning methods are proposed, which can enhance the representational capability of feature extraction during the backbone network pre-training process. 3. **Unified Approach**: The study shows that once the pre-training checkpoints are generated, they can be directly loaded into multiple perception baselines and benchmarks. Experimental results further validate that this AD-PT paradigm significantly improves accuracy on different benchmarks (e.g., Waymo, nuScenes, and KITTI). ### Method Overview 1. **Large-scale Point Cloud Dataset Preparation**: - **Category-aware Pseudo Label Generator**: Different baseline models are used to annotate different semantic classes, and semi-supervised methods (e.g., MeanTeacher) are employed to further improve accuracy on the ONCE validation set. - **Diversity-based Pre-training Processor**: Scene-level and region-level data diversity is increased through point-to-beam resampling and object rescaling strategies. 2. **Learning Unified Representation**: - **Unknown Instance Learning Head**: A two-branch unknown instance learning head is designed to avoid mistaking potential foreground instances for background parts, and consistency loss is used to ensure the consistency of the computed corresponding foreground regions. ### Experimental Results - The AD-PT paradigm significantly improves the performance of different baseline models on benchmarks such as Waymo, nuScenes, and KITTI. - Compared to existing self-supervised pre-training and semi-supervised learning methods, AD-PT demonstrates better generalization ability and higher accuracy across various datasets. ### Conclusion By constructing a large-scale and diverse point cloud dataset and designing effective pre-training methods, this paper successfully addresses the generalization problem of autonomous driving perception models across different tasks and datasets. This approach provides new insights and technical support for future autonomous driving research.

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations

Mutual Information-Driven Self-Supervised Point Cloud Pre-Training

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection

Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast

Point Cloud Pre-training with Diffusion Models

CooPre: Cooperative Pretraining for V2X Cooperative Perception

GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving

Cross-Dataset Collaborative Learning for Semantic Segmentation in Autonomous Driving

Self-supervised Point Cloud Representation Learning Via Separating Mixed Shapes

PV-SSD: A Multi-Modal Point Cloud Feature Fusion Method for Projection Features and Variable Receptive Field Voxel Features