Abstract:Occupancy prediction plays a pivotal role in autonomous driving (AD) due to the fine-grained geometric perception and general object recognition capabilities. However, existing methods often incur high computational costs, which contradicts the real-time demands of AD. To this end, we first evaluate the speed and memory usage of most public available methods, aiming to redirect the focus from solely prioritizing accuracy to also considering efficiency. We then identify a core challenge in achieving both fast and accurate performance: \textbf{the strong coupling between geometry and semantic}. To address this issue, 1) we propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation. In the BEV branch, a BEV-level temporal fusion module and a U-Net encoder is introduced to extract dense semantic features. In the voxel branch, a large-kernel re-parameterized 3D convolution is proposed to refine sparse 3D geometry and reduce computation. Moreover, we propose a novel BEV-Voxel lifting module that projects BEV features into voxel space for feature fusion of the two branches. In addition to the network design, 2) we also propose a Geometric-Semantic Decoupled Learning (GSDL) strategy. This strategy initially learns semantics with accurate geometry using ground-truth depth, and then gradually mixes predicted depth to adapt the model to the predicted geometry. Extensive experiments on the widely-used Occ3D-nuScenes benchmark demonstrate the superiority of our method, which achieves a 39.4 mIoU with 20.0 FPS. This result is $\sim 3 \times$ faster and +1.9 mIoU higher compared to FB-OCC, the winner of CVPR2023 3D Occupancy Prediction Challenge. Our code will be made open-source.

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

UniVision: A Unified Framework for Vision-Centric 3D Perception

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers