BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

Zhiwei Lin,Yongtao Wang,Shengxiang Qi,Nan Dong,Ming-Hsuan Yang
DOI: https://doi.org/10.1609/aaai.v38i4.28141
2024-01-01
Abstract:Existing LiDAR-based 3D object detection methods for autonomous drivingscenarios mainly adopt the training-from-scratch paradigm. Unfortunately, thisparadigm heavily relies on large-scale labeled data, whose collection can beexpensive and time-consuming. Self-supervised pre-training is an effective anddesirable way to alleviate this dependence on extensive annotated data. In thiswork, we present BEV-MAE, an efficient masked autoencoder pre-trainingframework for LiDAR-based 3D object detection in autonomous driving.Specifically, we propose a bird's eye view (BEV) guided masking strategy toguide the 3D encoder learning feature representation in a BEV perspective andavoid complex decoder design during pre-training. Furthermore, we introduce alearnable point token to maintain a consistent receptive field size of the 3Dencoder with fine-tuning for masked point cloud inputs. Based on the propertyof outdoor point clouds in autonomous driving scenarios, i.e., the point cloudsof distant objects are more sparse, we propose point density prediction toenable the 3D encoder to learn location information, which is essential forobject detection. Experimental results show that BEV-MAE surpasses priorstate-of-the-art self-supervised methods and achieves a favorably pre-trainingefficiency. Furthermore, based on TransFusion-L, BEV-MAE achieves newstate-of-the-art LiDAR-based 3D object detection results, with 73.6 NDS and69.6 mAP on the nuScenes benchmark. The source code will be released athttps://github.com/VDIGPKU/BEV-MAE
What problem does this paper attempt to address?