Junyi Ma,Xieyuanli Chen,Jiawei Huang,Jingyi Xu,Zhen Luo,Jintao Xu,Weihao Gu,Rui Ai,Hesheng Wang
Abstract:Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: <a class="link-external link-https" href="https://github.com/haomo-ai/Cam4DOcc" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The paper aims to solve the problem of 4D occupancy prediction using only camera images in autonomous driving applications. Specifically, existing occupancy estimation techniques mainly rely on current observations to provide dense occupancy representations of large - scale scenes, but most of these methods are limited to representing the current 3D space and do not consider the future states of surrounding objects on the time axis. To extend camera - only occupancy estimation to spatio - temporal prediction, the paper proposes a new benchmark named Cam4DOcc to evaluate the changes in the surrounding environment over a future period.
### Main contributions
1. **Propose the Cam4DOcc benchmark**: This is the first benchmark to promote future camera - based 4D occupancy prediction work.
2. **Propose a new dataset format**: By leveraging existing datasets (such as nuScenes, nuScenes - Occupancy and Lyft - Level5), a new dataset format is constructed, which is suitable for prediction tasks in autonomous driving scenarios.
3. **Provide four baseline methods**:
- Static world occupancy model
- Voxelization of point cloud prediction
- 2D - 3D instance prediction
- End - to - end 4D occupancy prediction network
4. **Introduce a standardized evaluation protocol**: A set of standardized evaluation protocols is proposed, and comprehensive experiments and detailed analyses are carried out to evaluate the performance of various baseline methods in current and future occupancy estimation.
### Task definition
Given the continuous camera images \(I = \{I_t\}_{t = - N_p}^0\) of the past \(N_p\) frames and the current frame, the goal of 4D occupancy prediction is to output the current occupancy \(O_c\in\mathbb{R}^{1\times H\times W\times L}\) and the future occupancy \(O_f\in\mathbb{R}^{N_f\times H\times W\times L}\). Here, \(H\), \(W\), and \(L\) represent the height, width, and length of a specific range defined in the current coordinate system, respectively. Each voxel \(O_f\) has \(N_f\) continuous states \(S = \{S_t\}_{t = 1}^{N_f}\) at future time stamps \(t\), indicating whether it is free or occupied at each future time stamp.
### Dataset format
The Cam4DOcc benchmark introduces a new dataset format based on the original nuScenes, nuScenes - Occupancy and Lyft - Level5 datasets. The construction process of the dataset is as follows:
1. **Split the original data**: Split the original nuScenes dataset into sequences with a time length of \(N = N_p+N_f + 1\).
2. **Extract and transform annotations**: Extract the continuous semantic and instance annotations of movable objects in each sequence and transform them to the current coordinate system (\(t = 0\)).
3. **Voxelize the 3D space**: Voxelize the current 3D space and attach semantic/instance labels to the voxel grids of movable objects using bounding box annotations.
4. **Filter invalid instances**: Discard invalid instances with a visibility lower than 40%, those that first appear in future frames, or those that are outside the predefined range.
5. **Generate 3D backward centripetal flow**: Use instance association to generate 3D backward centripetal flow, pointing from the voxel at time \(t\) to the corresponding 3D instance center at time \(t - 1\).
### Evaluation protocol
To comprehensively evaluate the performance of camera - only 4D occupancy prediction, the paper establishes multiple evaluation tasks and metrics with different complexity levels.
1. **Predict inflated GMO**: Divide all occupancy grids into GMO and other categories, where the voxel grids within the instance bounding boxes from nuScenes and Lyft - Level5 are labeled as GMO.
2. **Predict fine - grained GMO**: Similarly, divide the categories into GMO and others, but the GMO annotations are directly from the voxel - level labels of nuScenes - Occupancy, removing invalid grids.
3. **Predict inflated GMO, fine - grained GSO and**