Abstract:Multi-sensor fusion significantly enhances the accuracy and robustness of 3D semantic occupancy prediction, which is crucial for autonomous driving and robotics. However, existing approaches depend on large image resolutions and complex networks to achieve top performance, hindering their application in practical scenarios. Additionally, most multi-sensor fusion approaches focus on improving fusion features while overlooking the exploration of supervision strategies for these features. To this end, we propose DAOcc, a novel multi-sensor fusion occupancy network that leverages 3D object detection supervision to assist in achieving superior performance, while using a deployment-friendly image feature extraction network and practical input image resolution. Furthermore, we introduce a BEV View Range Extension strategy to mitigate the adverse effects of reduced image resolution. As a result, our approach achieves new state-of-the-art results on the Occ3D-nuScenes and SurroundOcc datasets, using ResNet50 and a 256x704 input image resolution. Code will be made available at <a class="link-external link-https" href="https://github.com/AlphaPlusTT/DAOcc" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key problems in multi - sensor fusion in 3D semantic occupancy prediction: 1. **Dependence on high - resolution images and complex networks**: - Existing methods usually rely on high - resolution images (such as 900×1600) and complex image feature extraction networks (such as ResNet101) to achieve the best performance. This makes these models difficult to be deployed on edge devices because they require too much computing resources. 2. **Insufficiency of supervision strategies**: - Most multi - sensor fusion methods mainly focus on improving the fused features, while ignoring the supervision strategies for these features. For example, CO - Occ only uses the distance truth value of point cloud data for supervision and fails to fully utilize the geometric and structural information in the point cloud. 3. **Effective utilization of point cloud data**: - Point cloud data is sparser than images but contains rich geometric and structural information. Existing methods have not fully exploited the unique advantages of point cloud data, especially in multi - modal 3D semantic occupancy prediction tasks. 4. **Impact of reducing image resolution**: - Reducing image resolution will lead to information loss and affect prediction accuracy. How to maintain or improve performance at a lower resolution is a challenge. ### Proposed solutions To solve the above problems, the authors propose DAOcc (Detection - Assisted Occupancy), a new multi - sensor fusion occupancy network. Specific contributions include: 1. **Simple and efficient multi - modal baseline network**: - A simple and efficient multi - modal 3D semantic occupancy prediction baseline network is designed, eliminating the need for complex deformable attention modules and image depth estimation. 2. **Introduction of 3D object detection - assisted supervision**: - 3D object detection - assisted supervision is used to enhance the discriminative ability of the fused features, making the fused features more sensitive to object boundaries and able to perceive the relationships between internal structures. 3. **BEV view range extension (BVRE) strategy**: - The BEV View Range Extension (BVRE) strategy is introduced. By expanding the point cloud processing range, a larger BEV field of view is provided, more context information is added, and the adverse effects caused by low - resolution images are alleviated. 4. **New state - of - the - art performance**: - New state - of - the - art performance is established on the Occ3D - nuScenes and SurroundOcc datasets, while using ResNet50 and an input image resolution of 256×704. Through these improvements, DAOcc not only improves the accuracy and robustness of 3D semantic occupancy prediction but also makes it more suitable for deployment in practical applications.

DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction

AFOcc: Multi-Modal Semantic Occupancy Prediction with Accurate Fusion

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction

Towards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection

TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

AdaOcc: Adaptive-Resolution Occupancy Prediction

EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

Enhancing 3D object detection through multi-modal fusion for cooperative perception

Learning Occupancy for Monocular 3D Object Detection

OVO: Open-Vocabulary Occupancy

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction