LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation

Song Wang,Wentong Li,Wenyu Liu,Xiaolu Liu,Jianke Zhu
2023-06-05
Abstract:Semantic map construction under bird's-eye view (BEV) plays an essential role in autonomous driving. In contrast to camera image, LiDAR provides the accurate 3D observations to project the captured 3D features onto BEV space inherently. However, the vanilla LiDAR-based BEV feature often contains many indefinite noises, where the spatial features have little texture and semantic cues. In this paper, we propose an effective LiDAR-based method to build semantic map. Specifically, we introduce a BEV feature pyramid decoder that learns the robust multi-scale BEV features for semantic map construction, which greatly boosts the accuracy of the LiDAR-based method. To mitigate the defects caused by lacking semantic cues in LiDAR data, we present an online Camera-to-LiDAR distillation scheme to facilitate the semantic learning from image to point cloud. Our distillation scheme consists of feature-level and logit-level distillation to absorb the semantic information from camera in BEV. The experimental results on challenging nuScenes dataset demonstrate the efficacy of our proposed LiDAR2Map on semantic map construction, which significantly outperforms the previous LiDAR-based methods over 27.9% mIoU and even performs better than the state-of-the-art camera-based approaches. Source code is available at: <a class="link-external link-https" href="https://github.com/songw-zju/LiDAR2Map" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively construct semantic maps in autonomous driving. Specifically, the authors propose a LiDAR-based method to construct semantic maps in Bird's Eye View (BEV) to overcome the issues present in traditional methods. ### Background Issues 1. **Limitations of Camera Images**: - While camera images can provide rich texture and semantic information, they suffer from spatial distortion issues when constructing semantic maps and rely on high-resolution images and large pre-trained models, which pose challenges in practical applications. 2. **Limitations of LiDAR Data**: - LiDAR provides accurate 3D spatial information, but the BEV features it generates often contain a lot of uncertain noise and lack texture and semantic clues. ### Proposed Method To overcome the above issues, the authors propose the **LiDAR2Map** method, which mainly includes the following aspects: 1. **BEV Feature Pyramid Decoder (BEV-FPD)**: - An efficient decoder is introduced to learn robust multi-scale BEV feature representations from the precise spatial information of LiDAR point clouds. This improves the accuracy of the baseline model. 2. **Online Camera-to-LiDAR Distillation Scheme**: - An online camera-to-LiDAR distillation scheme is proposed to transfer semantic information from images to LiDAR data through feature-level and logic-level distillation. Specifically, it includes: - **Position-Guided Feature Fusion Module (PGF2M)**: Used to better fuse the features of the camera and LiDAR in the BEV space. - **Feature-Level Distillation (FD)**: Generates a global affinity map through a tree filter to achieve feature-level distillation. - **Logic-Level Distillation (LD)**: Measures the similarity of probability distributions through KL divergence, allowing the LiDAR branch to learn soft labels from the camera-LiDAR fusion model. ### Experimental Results - Experimental results on the nuScenes dataset show that LiDAR2Map significantly outperforms existing LiDAR-based methods in the semantic map construction task, with mIoU improving from 29.5% to 57.4%. - In the vehicle segmentation task, LiDAR2Map also performs excellently, not only surpassing existing camera-based methods in accuracy but also having advantages in model parameters and inference speed. ### Main Contributions 1. An efficient framework, LiDAR2Map, is proposed, where the BEV Feature Pyramid Decoder can learn robust BEV feature representations, improving the performance of the baseline model. 2. An effective online camera-to-LiDAR distillation scheme is introduced, performing feature-level and logic-level distillation during training to fully absorb the semantic representations from images. 3. Extensive experiments are conducted on the nuScenes dataset, including map and vehicle segmentation tasks, demonstrating the superior performance of the proposed method. In summary, this paper proposes an efficient and accurate semantic map construction method by combining the precise spatial information of LiDAR and the rich semantic information of cameras.