Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Yulin He,Wei Chen,Tianci Xun,Yusong Tan
2024-07-21
Abstract:Occupancy prediction plays a pivotal role in autonomous driving (AD) due to the fine-grained geometric perception and general object recognition capabilities. However, existing methods often incur high computational costs, which contradicts the real-time demands of AD. To this end, we first evaluate the speed and memory usage of most public available methods, aiming to redirect the focus from solely prioritizing accuracy to also considering efficiency. We then identify a core challenge in achieving both fast and accurate performance: \textbf{the strong coupling between geometry and semantic}. To address this issue, 1) we propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation. In the BEV branch, a BEV-level temporal fusion module and a U-Net encoder is introduced to extract dense semantic features. In the voxel branch, a large-kernel re-parameterized 3D convolution is proposed to refine sparse 3D geometry and reduce computation. Moreover, we propose a novel BEV-Voxel lifting module that projects BEV features into voxel space for feature fusion of the two branches. In addition to the network design, 2) we also propose a Geometric-Semantic Decoupled Learning (GSDL) strategy. This strategy initially learns semantics with accurate geometry using ground-truth depth, and then gradually mixes predicted depth to adapt the model to the predicted geometry. Extensive experiments on the widely-used Occ3D-nuScenes benchmark demonstrate the superiority of our method, which achieves a 39.4 mIoU with 20.0 FPS. This result is $\sim 3 \times$ faster and +1.9 mIoU higher compared to FB-OCC, the winner of CVPR2023 3D Occupancy Prediction Challenge. Our code will be made open-source.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve real - time 3D occupancy prediction in autonomous driving (AD) while maintaining high precision. Although existing methods perform well in terms of accuracy, they are often accompanied by high computational costs, resulting in the inability to meet the real - time performance requirements for autonomous driving. Specifically, the paper points out: 1. **High computational cost**: When existing methods perform 3D occupancy prediction, they usually need to process a large number of 3D voxel features, which leads to extremely high computational costs and memory usage. For example, on an Nvidia A100 GPU, the inference speed of some methods is only 1 - 3 FPS, and the memory usage exceeds 10,000 MB. 2. **Strong coupling between geometry and semantics**: In 3D occupancy prediction, there is a strong coupling relationship between geometric structures (such as depth prediction) and semantic information (such as object classification). This coupling relationship makes it difficult for the model to be optimized during the training process, especially in the early stage. Inaccurate depth prediction will seriously affect the subsequent semantic classification performance. To address these problems, the paper proposes the following solutions: 1. **Geometric - Semantic Dual - Branch Network (GSDBN)**: By designing a two - branch network to process sparse geometric information and dense semantic information respectively. Among them, the BEV branch is responsible for extracting dense semantic features, while the voxel branch refines the sparse geometric structure through re - parameterized 3D large - kernel convolutions, thereby improving computational efficiency while ensuring geometric integrity. 2. **Geometric - Semantic Decoupled Learning (GSDL)**: A new learning strategy is proposed. By using real - depth information in the early stage of training to decouple the geometric and semantic learning processes. As the training progresses, the predicted depth information is gradually mixed, enabling the model to adapt to the predicted geometric structure, thereby achieving efficient real - time inference while maintaining high precision. 3. **BEV - Voxel Lifting Module**: A BEV - voxel lifting module is designed to project BEV - level semantic features into voxel space, effectively fusing the features of the two branches and restoring the lost height information. Through these innovations, the method GSD - Occ proposed in the paper has achieved significant performance improvements in the Occ3D - nuScenes benchmark test, achieving an accuracy of 39.4 mIoU and an inference speed of 20.0 FPS, having obvious advantages in terms of accuracy and real - time performance compared with other methods.