Abstract:Learning universal representations from unlabeled 3D point clouds is essential to improve the generalization and safety of autonomous driving. Generative self-supervised point cloud pre-training with low-level features as pretext tasks is a mainstream paradigm. However, from the perspective of mutual information, this approach is constrained by spatial information and entangled representations. In this study, we propose a generalized generative self-supervised point cloud pre-training framework called GPICTURE. High-level features were used as an additional pretext task to enhance the understanding of semantic information. Considering the varying difficulties caused by the discrimination of voxel features, we designed inter-class and intra-class discrimination-guided masking (I2Mask) to set the masking ratio adaptively. Furthermore, to ensure a hierarchical and stable reconstruction process, centered kernel alignment-guided hierarchical reconstruction and differential-gated progressive learning were employed to control multiple reconstruction tasks. Complete theoretical analyses demonstrated that high-level features can enhance the mutual information between latent features and high-level features, as well as the input point cloud. On Waymo, nuScenes, and SemanticKITTI, we achieved a 75.55% mAP for 3D object detection, 79.7% mIoU for 3D semantic segmentation, and 18.8% mIoU for occupancy prediction. Specifically, with only 50% of the fine-tuning data required, the performance of GPICURE was close to that of training from scratch with 100% of the fine-tuning data. In addition, consistent visualization with downstream tasks and a 57% reduction in weight disparity demonstrated a better fine-tuning starting point. The project page is hosted at https://gpicture-page.github.io/.

Towards All-in-one Pre-training Via Maximizing Multi-modal Mutual Information

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI

Multimodal Pretraining from Monolingual to Multilingual

MVContrast: Unsupervised Pretraining for Multi-view 3D Object Recognition

Mutual Information-Driven Self-Supervised Point Cloud Pre-Training

POA: Pre-training Once for Models of All Sizes

GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning

MVP: Multimodality-Guided Visual Pre-training

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining

Multi-dataset Pretraining: A Unified Model for Semantic Segmentation

The effectiveness of MAE pre-pretraining for billion-scale pretraining

M6: A Chinese Multimodal Pretrainer.

Delving into the Pre-training Paradigm of Monocular 3D Object Detection

Explore the Limits of Omni-modal Pretraining at Scale

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment