Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Xiaoyu Tian,Tao Jiang,Longfei Yun,Yucheng Mao,Huitong Yang,Yue Wang,Yilun Wang,Hang Zhao
2023-12-14
Abstract:Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. 3D occupancy prediction, which estimates the detailed occupancy states and semantics of a scene, is an emerging task to overcome these limitations. To support 3D occupancy prediction, we develop a label generation pipeline that produces dense, visibility-aware labels for any given scene. This pipeline comprises three stages: voxel densification, occlusion reasoning, and image-guided voxel refinement. We establish two benchmarks, derived from the Waymo Open Dataset and the nuScenes Dataset, namely Occ3D-Waymo and Occ3D-nuScenes benchmarks. Furthermore, we provide an extensive analysis of the proposed dataset with various baseline models. Lastly, we propose a new model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance on the Occ3D benchmarks. The code, data, and benchmarks are released at <a class="link-external link-https" href="https://tsinghua-mars-lab.github.io/Occ3D/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing 3D perception methods in robotic vision systems such as autonomous driving. Specifically, current 3D object detection methods usually focus on estimating 3D bounding boxes, ignoring finer geometric details and having difficulty dealing with general objects (GOs) not in the vocabulary. To overcome these limitations, the paper introduces the 3D occupancy prediction task, aiming to estimate the detailed occupancy state and semantic labels of each voxel in the scene. ### Specific manifestations of the problem: 1. **Loss of geometric details**: 3D bounding box representations will ignore the geometric details of objects, such as the part of a construction vehicle's mechanical arm that extends from the main body. 2. **Neglect of unlisted objects**: There are many objects in the real world that are not pre - defined in categories (such as trash cans on the street), and these objects are usually ignored or unlabeled in the dataset. 3. **Inadaptability in dynamic scenes**: Existing 3D perception methods have difficulty dealing with complex situations in dynamic scenes. ### Solutions: To solve the above problems, the paper proposes the following solutions: - **Occ3D dataset**: A large - scale 3D occupancy prediction benchmark dataset is constructed, which contains rich semantic and geometric expressions. This dataset generates high - quality labels through multi - frame aggregation, occlusion reasoning, and image - guided voxel refinement. - **Automatic annotation generation pipeline**: A strict automatic annotation generation pipeline is proposed to solve the problems of sparsity, occlusion, and 3D - 2D alignment. - **Coarse - to - Fine Occupancy (CTF - Occ) network**: A Transformer - based coarse - to - fine 3D occupancy prediction network is proposed. By means of the cross - attention mechanism, 2D image features are aggregated into 3D space, thereby achieving more accurate 3D occupancy prediction. ### Main contributions: 1. **Introduction of the Occ3D dataset**: A high - quality 3D occupancy prediction benchmark is provided, which promotes the research in this emerging field. 2. **Proposal of an automatic annotation generation pipeline**: A strict automatic annotation generation pipeline is developed, and its effectiveness is comprehensively verified. 3. **Proposal of the CTF - Occ network**: A new model is proposed, which achieves superior 3D occupancy prediction performance on the Occ3D benchmark. Through these contributions, the paper aims to promote the development of 3D perception technology, especially in application scenarios such as autonomous driving that require high precision and robustness.