Abstract:Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to enhance the pre-training effect of 3D point clouds by combining 2D image and 3D point cloud information within a self-supervised learning framework. Existing Masked Autoencoders (MAE) methods can only handle single-modal data, either images or point clouds, which overlooks the implicit semantic and geometric associations between 2D and 3D data. The authors propose a 2D-3D joint MAE framework named Joint-MAE, aiming to enhance the representation learning capability of 3D point clouds through cross-modal interaction. Specifically, Joint-MAE addresses the above problem through the following points: 1. **2D-3D Joint Pre-training**: By randomly masking both 3D point clouds and their projected 2D images and reconstructing the masked information, achieving joint pre-training of 2D and 3D modalities. 2. **Multi-level 2D-3D Embedding Modules**: Two multi-level 2D-3D embedding modules are designed to extract features at different scales and tokenize the 2D and 3D data initially. 3. **Joint Encoder and Decoder**: A joint encoder is constructed to interact 2D-3D semantic information through stacked Transformer blocks; and a joint decoder, including a modality-shared decoder and modality-specific decoders, is used to simultaneously reconstruct the masked 3D point cloud coordinates and 2D image pixels. 4. **Local Alignment Attention Mechanism**: A local alignment attention mechanism is introduced in each Transformer block of the joint encoder to ensure that only geometrically related 2D-3D tokens participate in the attention computation, thereby enhancing the representation learning of 3D point clouds. 5. **Cross-modal Reconstruction Loss**: In addition to independent 2D and 3D reconstruction losses, an additional cross-modal reconstruction loss is introduced to provide 2D-3D geometric constraints, further improving the pre-training effect. Through these designs, Joint-MAE performs excellently on multiple downstream tasks, such as achieving a linear SVM classification accuracy of 92.4% on the ModelNet40 dataset and 86.07% accuracy on the most challenging split (PB-T50-RS) of the ScanObjectNN dataset. These results indicate that Joint-MAE can effectively utilize 2D image information to enhance the representation learning capability of 3D point clouds.

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

Masked Autoencoders for Point Cloud Self-supervised Learning.

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Masked Autoencoders in 3D Point Cloud Representation Learning

Inter-Modal Masked Autoencoder for Self-Supervised Learning on Point Clouds

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

LR-MAE: Locate While Reconstructing with Masked Autoencoders for Point Cloud Self-supervised Learning

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection

GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Multi-Angle Point Cloud-VAE: Unsupervised Feature Learning for 3D Point Clouds from Multiple Angles by Joint Self-Reconstruction and Half-to-Half Prediction

Mapping medical image-text to a joint space via masked modeling

BEV-MAE: Bird's Eye View Masked Autoencoders for Outdoor Point Cloud Pre-training

MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders