Abstract:Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at <a class="link-external link-https" href="https://github.com/PRED4pc/PRED" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to improve the pre - training of LiDAR point clouds by combining image information to deal with the inherent incompleteness of point cloud data**. Specifically, the paper proposes a new framework named PRED (Pre - training via semantic rendering), aiming to improve the performance of point cloud encoders in outdoor scenarios such as autonomous driving through semantic rendering and a high - ratio point - level occlusion - aware mask strategy. ### Main problem decomposition: 1. **Incompleteness of point cloud data**: - In outdoor LiDAR datasets, more than 30% of the labeled objects contain fewer than five points, which leads to the ambiguity of point cloud reconstruction and thus affects the quality of the training process. - For example, in the nuScenes dataset, the point cloud data of many objects is incomplete, which makes it difficult for the model to accurately learn the features of these objects. 2. **Challenges of aligning images and point clouds**: - Images provide more comprehensive information and rich semantics compared to point clouds, but directly aligning point clouds with images has an occlusion problem, which may lead to misalignment between points and pixels. - These misalignments will further affect the effect of pre - training because there may be deviations in the alignment between LiDAR and cameras. ### Solutions: 1. **Semantic Rendering**: - The paper introduces a semantic rendering method based on the Bird - Eye - View (BEV) feature map, using the semantic information of images for supervision. - Through neural rendering technology, semantic predictions are generated from the BEV feature map and optimized in combination with depth loss, thus effectively dealing with the occlusion problem. 2. **Point - wise Masking with High Mask Ratio**: - A 95% high - ratio point - level mask strategy is introduced. Compared with the previous 75% patch - level mask method, this method can better preserve the semantic information of the scene. - For smaller objects (such as pedestrians), point - level masks can avoid completely deleting these objects, thus preserving their semantic information. ### Experimental results: - Through experiments on multiple large - scale outdoor LiDAR datasets (such as nuScenes and ONCE), the effectiveness of the PRED framework has been verified. - The results show that PRED significantly outperforms existing point cloud pre - training methods in 3D object detection and BEV map segmentation tasks. ### Summary: By proposing the PRED framework, this paper successfully solves the challenges of point cloud data incompleteness and the alignment of images and point clouds, providing a new and effective solution for outdoor point cloud processing.

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

PRED: pre-training via semantic rendering on LiDAR point clouds

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Pass3d: Precise And Accelerated Semantic Segmentation For 3d Point Cloud

BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

BEV-MAE: Bird's Eye View Masked Autoencoders for Outdoor Point Cloud Pre-training

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining

Ponder: Point Cloud Pre-training via Neural Rendering

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Point-to-Pixel Prompting for Point Cloud Analysis With Pre-Trained Image Models

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

Geometric-aware Pretraining for Vision-centric 3D Object Detection

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Efficient Urban-scale Point Clouds Segmentation with BEV Projection

Position-Guided Point Cloud Panoptic Segmentation Transformer

Rethinking 3D LiDAR Point Cloud Segmentation