Abstract:Masked autoencoder has been widely explored in point cloud self-supervised learning, whereby the point cloud is generally divided into visible and masked parts. These methods typically include an encoder accepting visible patches (normalized) and corresponding patch centers (position) as input, with the decoder accepting the output of the encoder and the centers (position) of the masked parts to reconstruct each point in the masked patches. Then, the pre-trained encoders are used for downstream tasks. In this paper, we show a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. In other words, the centers of patches are important and the reconstruction objective does not necessarily rely on representations of the encoder, thus preventing the encoder from learning semantic representations. Based on this key observation, we propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) which guides the model to learn to predict the significant centers and use the predicted centers to replace the directly provided centers. Specifically, we propose a Predicting Center Module (PCM) that shares parameters with the original encoder with extra cross-attention to predict centers. Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE, particularly outperforming it by 5.50%, 6.03%, and 5.17% on three variants of ScanObjectNN. The code will be made publicly available.

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

Denoising with a Joint-Embedding Predictive Architecture

Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling

Graph-level Representation Learning with Joint-Embedding Predictive Architectures

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention

Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture

3DMAE: Joint SAR and Optical Representation Learning with Vertical Masking.

SemDM: Task-oriented masking strategy for self-supervised visual learning

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks