LiDAR-Camera Continuous Fusion in Voxelized Grid for Semantic Scene Completion

Zonghao Lu,Bing Cao,Qinghua Hu
DOI: https://doi.org/10.1109/tcsvt.2024.3435045
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Semantic Scene Completion (SSC) requires a comprehensive perception of both the geometry and semantics across the entire 3D scene. In the domain of autonomous driving, the majority of existing SSC methods rely on single-modal images (e.g., MonoScene, TPVformer) or point clouds (e.g., S3CNet, JS3C-Net), without taking into account the complementary information from bimodal sources. In this work, we propose an Image and Point Cloud continuous fusion in Voxel Network (IPVoxelNet) to address SSC within the voxelized space. IPVoxelNet represents images and point clouds within a unified voxelized space and utilizes the Image and Point Cloud Fusion (IPF) layers for continuous fusion of bimodal features. Specifically, IPVoxelNet utilizes pixel-to-voxel reprojection to map pixels into 3D space, leveraging the dense semantics of images. Unordered point clouds are represented in voxel space through regularization. IPVoxelNet independently learns the geometry and semantics of each modality. Additionally, we propose cross-modal knowledge distillation to transfer geometric information from point clouds to images. We validate our model on the challenging SemanticKITTI and nuScenes-Occupancy datasets, achieving state-of-the-art results across multiple classes. IPVoxelNet demonstrates competitive performance in both geometry (SC IoU) and semantics (mIoU).
What problem does this paper attempt to address?