Abstract:In autonomous driving, 3D occupancy prediction outputs voxel-wise status and semantic labels for more comprehensive understandings of 3D scenes compared with traditional perception tasks, such as 3D object detection and bird's-eye view (BEV) semantic segmentation. Recent researchers have extensively explored various aspects of this task, including view transformation techniques, ground-truth label generation, and elaborate network design, aiming to achieve superior performance. However, the inference speed, crucial for running on an autonomous vehicle, is neglected. To this end, a new method, dubbed FastOcc, is proposed. By carefully analyzing the network effect and latency from four parts, including the input image resolution, image backbone, view transformation, and occupancy prediction head, it is found that the occupancy prediction head holds considerable potential for accelerating the model while keeping its accuracy. Targeted at improving this component, the time-consuming 3D convolution network is replaced with a novel residual-like architecture, where features are mainly digested by a lightweight 2D BEV convolution network and compensated by integrating the 3D voxel features interpolated from the original image features. Experiments on the Occ3D-nuScenes benchmark demonstrate that our FastOcc achieves state-of-the-art results with a fast inference speed.

What problem does this paper attempt to address?

This paper mainly discusses the problem of 3D occupancy prediction in autonomous driving, which is a key task as it provides a more comprehensive 3D scene understanding than traditional perception tasks such as 3D object detection and bird's-eye-view semantic segmentation. Despite existing research exploring various aspects including viewpoint transformation techniques, label generation, and network design, the inference speed is often overlooked, which is crucial in autonomous driving. The paper proposes a new method called FastOcc aimed at accelerating 3D occupancy prediction while maintaining high accuracy. By analyzing the effects and latency of four parts of the network: input image resolution, image backbone network, viewpoint transformation, and occupancy prediction head, it is found that the occupancy prediction head has great potential in optimizing speed and accuracy balance. FastOcc replaces the time-consuming 3D convolutional network by using a lightweight 2D bird's-eye-view convolutional network to digest features and compensating with 3D voxel features interpolated from the original image features. Experimental results show that FastOcc achieves state-of-the-art results on the Occ3D-nuScenes benchmark test while having faster inference speed. The latency of a single inference is reduced to 63 milliseconds, further reduced to 32 milliseconds with the acceleration of the TensorRT SDK. The paper also compares the performance and runtime of different methods, as well as traditional visual perception methods such as 3D object detection and 3D occupancy prediction. The advantage of FastOcc lies in its simplification of 3D perception tasks by compressing features into a bird's-eye-view representation and decoding in 2D form, followed by refining and enhancing 2D features with interpolated 3D features. Additionally, the paper introduces the training loss function. In summary, FastOcc is a real-time and efficient method for 3D occupancy prediction that improves the ability of autonomous driving scene understanding and real-time perception.

FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View

Fast Occupancy Network

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction

Fully Sparse 3D Occupancy Prediction

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction

MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

OPUS: Occupancy Prediction Using a Sparse Set

EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

AdaOcc: Adaptive-Resolution Occupancy Prediction

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder