Abstract:3D occupancy prediction based on multi-sensor fusion,crucial for a reliable autonomous driving system, enables fine-grained understanding of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. To address these issues, we propose OccFusion, a depth estimation free multi-modal fusion framework. Additionally, we introduce a generalizable active training method and an active decoder that can be applied to any occupancy prediction model, with the potential to enhance their performance. Experiments conducted on nuScenes-Occupancy and nuScenes-Occ3D demonstrate our framework's superior performance. Detailed ablation studies highlight the effectiveness of each proposed method.

What problem does this paper attempt to address?

The paper attempts to address the following key issues: 1. **Challenges of Depth Estimation**: Traditional 3D occupancy prediction methods based on multi-sensor fusion rely on depth estimation to process 2D image features. However, depth estimation is an ill-posed problem, and its robustness and accuracy are difficult to guarantee, which affects the overall performance of these methods. 2. **Demand for Computational Resources**: Fine-grained 3D occupancy prediction requires a large amount of computational resources, which is a significant bottleneck in practical applications. 3. **Effective Fusion of Multimodal Data**: How to effectively fuse 2D image features with 3D LiDAR features without relying on depth estimation is an urgent problem to be solved. To address the above challenges, the paper proposes the OccFusion framework, a multimodal fusion framework that does not require depth estimation. Specifically, the main contributions of this framework include: - **Point-to-Point Multimodal Feature Fusion**: By preprocessing the LiDAR point cloud to generate a denser and more evenly distributed point cloud, and directly fusing 2D image features with 3D LiDAR features point-to-point, the instability of depth estimation is avoided. - **Efficient Point Cloud Preprocessing Algorithm**: By generating synthetic point clouds and using farthest point sampling, the density and distribution of point clouds within each voxel are more uniform, thereby improving the effectiveness of feature fusion. - **Active Decoder and Training Method**: An active decoder is introduced, which can selectively refine the prediction of voxels with high uncertainty, significantly reducing the complexity of the model. Additionally, an active training method is proposed, allowing the model to prioritize learning from more challenging samples, further enhancing the model's performance. Through these innovations, experimental results on the nuScenes-Occupancy and nuScenes-Occ3D datasets show that the OccFusion framework outperforms existing multimodal baseline methods and performs particularly well in small object categories.

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

AFOcc: Multi-Modal Semantic Occupancy Prediction with Accurate Fusion

DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

PMAFusion: Projection-Based Multi-Modal Alignment for 3D Semantic Occupancy Prediction

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

AdaOcc: Adaptive-Resolution Occupancy Prediction

OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

OPUS: Occupancy Prediction Using a Sparse Set