Point Cloud Reconstruction is Insufficient to Learn 3D Representations

Weichen Xu,Jian Cao,Tianhao Fu,Ruilong Ren,Zicong Hu,Xixin Cao,Xing Zhang
DOI: https://doi.org/10.1145/3664647.3680890
2024-01-01
Abstract:This paper revisits the development of generative self-supervised learning in 2D images and 3D point clouds in autonomous driving. In 2D images, the pretext task has evolved from low-level to high-level features. Inspired by this, through explore model analysis, we find that the gap in weight distribution between self-supervised learning and supervised learning is substantial when employing only low-level features as the pretext task in 3D point clouds. Low-level features represented by PoInt Cloud reconsTruction are insUfficient to learn 3D REpresentations (dubbed PICTURE). To advance the development of pretext tasks, we propose a unified generative self-supervised framework. Firstly, high-level features are demonstrated to exhibit semantic consistency with downstream tasks. We utilize the high-level features as an additional pretext task to enhance the understanding of semantic information during the pre-training. Next, we propose inter-class and intra-class discrimination-guided masking (I2Mask) based on the attributes of the high-level features, adaptively setting the masking ratio for each superclass. On Waymo and nuScenes datasets, we achieve 75.13% mAP and 72.69% mAPH for 3D object detection, 79.4% mIoU for 3D semantic segmentation, and 18.4% mIoU for occupancy prediction. Extensive experiments have demonstrated the effectiveness and necessity of high-level features.
What problem does this paper attempt to address?