PatchMixing Masked Autoencoders for 3D Point Cloud Self-Supervised Learning

Chengxing Lin,Wenju Xu,Jian Zhu,Yongwei Nie,Ruichu Cai,Xuemiao Xu
DOI: https://doi.org/10.1109/tcsvt.2024.3405069
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Recently, Point-MAE has extended Masked Autoencoders (MAE) to point clouds for 3D self-supervised learning, which however faces two problems: (1) the shape similarity between the masked point cloud and original point cloud is high, and (2) the pretext task of reconstructing the original point cloud is straightforward which fails to compel the network to learn deep representative features. In this paper, we tackle these problems by proposing a PatchMixing strategy and a teacher-student training framework. First, with PatchMixing, we mix selected point patches of multiple point clouds and attempt to infer the object information from the resulting mixed point cloud. Due to the interference of other objects, the task is challenging but facilitates representation learning. Second, rather than directly restoring the original point cloud, we propose a novel pretext task that involves a two-branch teacher model and a student model. These models process the multiple input point clouds in different ways (no mixing, mixing + unmixing, mixing + masking), but are expected to output similar features, thereby compelling the network to extract essential features from the input. Extensive experiments show that our well-designed PatchMixing strategy and effective teacher-student learning architecture yield impressive results. Specifically, our model achieves a remarkable 92.9% classification accuracy in the Linear SVM task on the ModelNet40 dataset. Through pre-training and fine-tuning on downstream tasks, our method achieves an 89.8% classification accuracy on the most challenging split of ScanObjectNN and an outstanding 94.0% on ModelNet40.
What problem does this paper attempt to address?