Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

Ziyu Guo,Renrui Zhang,Longtian Qiu,Xianzhi Li,Pheng-Ann Heng
2023-09-26
Abstract:Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of how to enhance the pre-training effect of 3D point clouds by combining 2D image and 3D point cloud information within a self-supervised learning framework. Existing Masked Autoencoders (MAE) methods can only handle single-modal data, either images or point clouds, which overlooks the implicit semantic and geometric associations between 2D and 3D data. The authors propose a 2D-3D joint MAE framework named Joint-MAE, aiming to enhance the representation learning capability of 3D point clouds through cross-modal interaction. Specifically, Joint-MAE addresses the above problem through the following points: 1. **2D-3D Joint Pre-training**: By randomly masking both 3D point clouds and their projected 2D images and reconstructing the masked information, achieving joint pre-training of 2D and 3D modalities. 2. **Multi-level 2D-3D Embedding Modules**: Two multi-level 2D-3D embedding modules are designed to extract features at different scales and tokenize the 2D and 3D data initially. 3. **Joint Encoder and Decoder**: A joint encoder is constructed to interact 2D-3D semantic information through stacked Transformer blocks; and a joint decoder, including a modality-shared decoder and modality-specific decoders, is used to simultaneously reconstruct the masked 3D point cloud coordinates and 2D image pixels. 4. **Local Alignment Attention Mechanism**: A local alignment attention mechanism is introduced in each Transformer block of the joint encoder to ensure that only geometrically related 2D-3D tokens participate in the attention computation, thereby enhancing the representation learning of 3D point clouds. 5. **Cross-modal Reconstruction Loss**: In addition to independent 2D and 3D reconstruction losses, an additional cross-modal reconstruction loss is introduced to provide 2D-3D geometric constraints, further improving the pre-training effect. Through these designs, Joint-MAE performs excellently on multiple downstream tasks, such as achieving a linear SVM classification accuracy of 92.4% on the ModelNet40 dataset and 86.07% accuracy on the most challenging split (PB-T50-RS) of the ScanObjectNN dataset. These results indicate that Joint-MAE can effectively utilize 2D image information to enhance the representation learning capability of 3D point clouds.