Abstract:Self-supervised learning of point cloud aims to leverage unlabeled 3D data to learn meaningful representations without reliance on manual annotations. However, current approaches face challenges such as limited data diversity and inadequate augmentation for effective feature learning. To address these challenges, we propose GS-PT, which integrates 3D Gaussian Splatting (3DGS) into point cloud self-supervised learning for the first time. Our pipeline utilizes transformers as the backbone for self-supervised pre-training and introduces novel contrastive learning tasks through 3DGS. Specifically, the transformers aim to reconstruct the masked point cloud. 3DGS utilizes multi-view rendered images as input to generate enhanced point cloud distributions and novel view images, facilitating data augmentation and cross-modal contrastive learning. Additionally, we incorporate features from depth maps. By optimizing these tasks collectively, our method enriches the tri-modal self-supervised learning process, enabling the model to leverage the correlation across 3D point clouds and 2D images from various modalities. We freeze the encoder after pre-training and test the model's performance on multiple downstream tasks. Experimental results indicate that GS-PT outperforms the off-the-shelf self-supervised learning methods on various downstream tasks including 3D object classification, real-world classifications, and few-shot learning and segmentation.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to effectively learn meaningful representations from large-scale unlabeled 3D point cloud data using self-supervised learning methods, without relying on manual annotations. Specifically, current self-supervised learning methods face two main challenges when dealing with 3D point clouds: 1. **Data diversity and the scarcity of high-quality multimodal data pairs**: Effective self-supervised learning requires integrating information from various sources (such as point clouds, rendered RGB images, and depth maps), but these high-quality data pairs are very scarce in practical applications. 2. **Simple geometric transformations lead to single-feature representations**: Existing self-supervised learning methods typically rely on simple geometric transformations to augment data, which results in overly simplistic feature representations and affects the model's generalization ability. To address these challenges, the paper proposes GS-PT (Gaussian Splatting for Point Cloud Self-Supervised Learning), which for the first time applies 3D Gaussian Splatting (3DGS) technology to self-supervised learning of point clouds. By introducing 3DGS, GS-PT can generate enhanced point cloud distributions and new viewpoint images, thereby achieving richer data augmentation and cross-modal contrastive learning. Specifically, GS-PT uses multi-view rendered images as input to generate enhanced point cloud distributions and new viewpoint images, and combines depth map features to optimize multiple tasks, enriching the tri-modal self-supervised learning process and enabling the model to better utilize the associations between 3D point clouds and 2D images. Experimental results show that GS-PT outperforms existing self-supervised learning methods on multiple downstream tasks (such as 3D object classification, real-world classification, few-shot learning, and segmentation).

GS-PT: Exploiting 3D Gaussian Splatting for Comprehensive Point Cloud Understanding via Self-supervised Learning

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

Superpoint-guided Semi-supervised Semantic Segmentation of 3D Point Clouds

Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast

Mutual Information-Driven Self-Supervised Point Cloud Pre-Training

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding

CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding

PointVST: Self-Supervised Pre-training for 3D Point Clouds via View-Specific Point-to-Image Translation

Self-supervised 3D Point Cloud Completion via Multi-view Adversarial Learning

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

3DPCT: 3D Point Cloud Transformer with Dual Self-attention

ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

PointGT: A Method for Point-Cloud Classification and Segmentation Based on Local Geometric Transformation

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding

Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey

PointGame: Geometrically and Adaptively Masked Auto-Encoder on Point Clouds

Self-supervised Point Cloud Representation Learning Via Separating Mixed Shapes

Unsupervised contrastive learning with simple transformation for 3D point cloud data

Self-Supervised Intra-Modal and Cross-Modal Contrastive Learning for Point Cloud Understanding