Abstract:Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at: <a class="link-external link-https" href="https://github.com/Amazingren/Point-CMAE" rel="external noopener nofollow">this https URL</a>.

CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection

Masked Autoencoders for Point Cloud Self-supervised Learning.

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders

LR-MAE: Locate While Reconstructing with Masked Autoencoders for Point Cloud Self-supervised Learning

Masked Autoencoders in 3D Point Cloud Representation Learning

Masked Autoencoder for Pre-Training on 3D Point Cloud Object Detection

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

3DMAE: Joint SAR and Optical Representation Learning with Vertical Masking.

BEV-MAE: Bird's Eye View Masked Autoencoders for Outdoor Point Cloud Pre-training

Contrastive Masked Autoencoders are Stronger Vision Learners

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

GeoMask3D: Geometrically Informed Mask Selection for Self-Supervised Point Cloud Learning in 3D

Exploring Geometry-aware Contrast and Clustering Harmonization for Self-supervised 3D Object Detection.

Inter-Modal Masked Autoencoder for Self-Supervised Learning on Point Clouds