CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Zhangchen Ye,Tao Jiang,Chenfeng Xu,Yiming Li,Hang Zhao
2024-09-25
Abstract:Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{<a class="link-external link-https" href="https://github.com/Tsinghua-MARS-Lab/CVT-Occ" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the key challenges encountered in vision - based 3D occupancy prediction, especially the inherent limitations of monocular vision in depth estimation. Specifically: 1. **The problem of depth estimation in monocular vision**: - Monocular vision cannot provide accurate depth information, resulting in ambiguity and uncertainty in the estimation of object positions in 3D space. - This ambiguity makes it very difficult to perform 3D occupancy prediction relying solely on monocular images. 2. **Limitations of existing methods**: - Although stereo vision can enhance depth estimation, it is limited by the need for calibration and recalibration in practical applications, and it is difficult to be widely applied in autonomous vehicles and robotic systems. - Existing temporal fusion methods (such as self - attention mechanism, Warp and Concat, etc.) are relatively implicit when using temporal information and fail to fully utilize geometric constraints, resulting in limited performance improvement. ### Proposed solutions To solve the above problems, the paper introduces a new method - CVT - Occ (Cost Volume Temporal Fusion for 3D Occupancy Prediction), which improves 3D occupancy prediction in the following ways: 1. **Temporal fusion**: - Utilize the temporal geometric correspondences of multi - frame historical data. By sampling points along the line - of - sight direction for each voxel and combining the historical features of these points, a cost volume feature map is constructed. - In this way, CVT - Occ explicitly utilizes parallax cues, thus inferring depth information more accurately and reducing the ambiguity of depth estimation. 2. **Efficient cost volume construction**: - Different from the traditional method of calculating the cost volume for each pair of images, CVT - Occ avoids excessive computational overhead and achieves higher efficiency. - By learning the cost volume in a data - driven manner, the generalization ability and accuracy of the model are further improved. 3. **Experimental verification**: - The paper conducts strict experimental verification on the Occ3D - Waymo dataset. The results show that CVT - Occ significantly outperforms the existing state - of - the - art methods in the 3D occupancy prediction task and has the least additional computational cost. In conclusion, CVT - Occ solves the limitations of monocular vision in depth estimation by introducing temporal geometric correspondences and an efficient cost volume construction method, and significantly improves the accuracy of 3D occupancy prediction.