Abstract:The manual annotation for large-scale point clouds costs a lot of time and is usually unavailable in harsh real-world scenarios. Inspired by the great success of the pre-training and fine-tuning paradigm in both vision and language tasks, we argue that pre-training is one potential solution for obtaining a scalable model to 3D point cloud downstream tasks as well. In this paper, we, therefore, explore a new self-supervised learning method, called Mixing and Disentangling (MD), for 3D point cloud representation learning. As the name implies, we mix two input shapes and demand the model learning to separate the inputs from the mixed shape. We leverage this reconstruction task as the pretext optimization objective for self-supervised learning. There are two primary advantages: 1) Compared to prevailing image datasets, eg, ImageNet, point cloud datasets are de facto small. The mixing process can provide a much larger online training sample pool. 2) On the other hand, the disentangling process motivates the model to mine the geometric prior knowledge, eg, key points. To verify the effectiveness of the proposed pretext task, we build one baseline network, which is composed of one encoder and one decoder. During pre-training, we mix two original shapes and obtain the geometry-aware embedding from the encoder, then an instance-adaptive decoder is applied to recover the original shapes from the embedding. Albeit simple, the pre-trained encoder can capture the key points of an unseen point cloud and surpasses the encoder trained from scratch on downstream tasks. The proposed method has improved the empirical performance on both ModelNet-40 and ShapeNet-Part datasets in terms of point cloud classification and segmentation tasks. We further conduct ablation studies to explore the effect of each component and verify the generalization of our proposed strategy by harnessing different backbones.

Point Cloud Reconstruction is Insufficient to Learn 3D Representations

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Mutual Information-Driven Self-Supervised Point Cloud Pre-Training

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

Self-supervised Point Cloud Representation Learning Via Separating Mixed Shapes

SegContrast: 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination

Self-Supervised Point Cloud Representation Learning with Occlusion Auto-Encoder.

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

Self-Supervised Point Cloud Representation Learning with Occlusion Auto-Encoder

Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

Point‐AGM : Attention Guided Masked Auto‐Encoder for Joint Self‐supervised Learning on Point Clouds

GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding

PointMoment:Mixed-Moment-based Self-Supervised Representation Learning for 3D Point Clouds

Contrastive Predictive Autoencoders for Dynamic Point Cloud Self-Supervised Learning

Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning

Self-Supervised Deep Learning on Point Clouds by Reconstructing Space

PointUR-RL: Unified Self-Supervised Learning Method Based on Variable Masked Autoencoder for Point Cloud Reconstruction and Representation Learning

Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast

3D-OAE: Occlusion Auto-Encoders for Self-Supervised Learning on Point Clouds