Abstract:3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Another challenge lies in the efficiency. Previous methods such as Point-BERT and Point-MAE, employ k nearest neighbors to embed 3D tokens, requiring quadratic time complexity. To efficiently pre-train on such a large amount of data, we propose a linear-time-complexity token embedding strategy and a training-efficient 2D reconstruction target. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.

What problem does this paper attempt to address?

This paper attempts to address the issues of data scale and efficiency in 3D pre - training. Specifically: 1. **Data Scale Challenges**: - 3D perception tasks (such as robotics and augmented reality) rely on a large amount of clean 3D data for effective pre - training, but collecting such data is very difficult and expensive. - Existing 3D pre - training methods usually use complete 3D objects or 3D scenes reconstructed from RGB - D scans. These methods are not only costly but also prone to introducing noise and artifacts, resulting in insufficient data volume and diversity. 2. **Computational Efficiency Challenges**: - Existing methods such as Point - BERT and Point - MAE use the k - nearest - neighbor algorithm when embedding 3D tokens, which requires quadratic time complexity (i.e., \(O(n^2)\)), and is very inefficient for large - scale datasets. To solve these problems, the paper proposes the Pseudo - 3D Pre - training (P3P) method, which mainly includes the following innovations: 1. **Utilizing Pseudo - 3D Data**: - It is proposed to convert 2D images into pseudo - 3D data through a depth estimation model, thereby significantly increasing the amount and diversity of data available for pre - training. - A large number of 2D images from ImageNet - 1K are mixed with limited real 3D data (such as RGB - D scans) for self - supervised pre - training. 2. **Efficient 3D Token Embedding Strategy**: - A 3D token embedding method with linear time complexity - Sparse Weight Indexing (SWI) is introduced, which significantly improves pre - training efficiency. - This method is based on voxel representation and achieves efficient embedding through discrete coordinate hashing, avoiding the high computational complexity problem in traditional methods. 3. **Efficient 2D Reconstruction Objective**: - A 2D reconstruction objective is designed, which simplifies 3D prediction tasks into 2D prediction tasks, greatly reducing the computational space (from cubic - level to square - level) while maintaining the performance of downstream tasks. Through these improvements, the P3P method has achieved state - of - the - art performance in tasks such as 3D classification and few - shot learning, and has maintained high efficiency during pre - training and downstream fine - tuning.

P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Point-LGMask: Local and Global Contexts Embedding for Point Cloud Pre-training with Multi-Ratio Masking

Triple Point Masking

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

Masked Autoencoder for Pre-Training on 3D Point Cloud Object Detection

Enhancing Pseudo Label Quality for Pedestrian and Cyclist in Weakly Supervised 3D Object Detection

RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection

Masked Autoencoders in 3D Point Cloud Representation Learning

Mutual Information-Driven Self-Supervised Point Cloud Pre-Training

BEV-MAE: Bird's Eye View Masked Autoencoders for Outdoor Point Cloud Pre-training