OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

Ji Zhang,Yiran Ding,Zixin Liu
2024-07-10
Abstract:3D occupancy prediction based on multi-sensor fusion,crucial for a reliable autonomous driving system, enables fine-grained understanding of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. To address these issues, we propose OccFusion, a depth estimation free multi-modal fusion framework. Additionally, we introduce a generalizable active training method and an active decoder that can be applied to any occupancy prediction model, with the potential to enhance their performance. Experiments conducted on nuScenes-Occupancy and nuScenes-Occ3D demonstrate our framework's superior performance. Detailed ablation studies highlight the effectiveness of each proposed method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the following key issues: 1. **Challenges of Depth Estimation**: Traditional 3D occupancy prediction methods based on multi-sensor fusion rely on depth estimation to process 2D image features. However, depth estimation is an ill-posed problem, and its robustness and accuracy are difficult to guarantee, which affects the overall performance of these methods. 2. **Demand for Computational Resources**: Fine-grained 3D occupancy prediction requires a large amount of computational resources, which is a significant bottleneck in practical applications. 3. **Effective Fusion of Multimodal Data**: How to effectively fuse 2D image features with 3D LiDAR features without relying on depth estimation is an urgent problem to be solved. To address the above challenges, the paper proposes the OccFusion framework, a multimodal fusion framework that does not require depth estimation. Specifically, the main contributions of this framework include: - **Point-to-Point Multimodal Feature Fusion**: By preprocessing the LiDAR point cloud to generate a denser and more evenly distributed point cloud, and directly fusing 2D image features with 3D LiDAR features point-to-point, the instability of depth estimation is avoided. - **Efficient Point Cloud Preprocessing Algorithm**: By generating synthetic point clouds and using farthest point sampling, the density and distribution of point clouds within each voxel are more uniform, thereby improving the effectiveness of feature fusion. - **Active Decoder and Training Method**: An active decoder is introduced, which can selectively refine the prediction of voxels with high uncertainty, significantly reducing the complexity of the model. Additionally, an active training method is proposed, allowing the model to prioritize learning from more challenging samples, further enhancing the model's performance. Through these innovations, experimental results on the nuScenes-Occupancy and nuScenes-Occ3D datasets show that the OccFusion framework outperforms existing multimodal baseline methods and performs particularly well in small object categories.