Abstract:Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{<a class="link-external link-https" href="https://github.com/Tsinghua-MARS-Lab/CVT-Occ" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the key challenges encountered in vision - based 3D occupancy prediction, especially the inherent limitations of monocular vision in depth estimation. Specifically: 1. **The problem of depth estimation in monocular vision**: - Monocular vision cannot provide accurate depth information, resulting in ambiguity and uncertainty in the estimation of object positions in 3D space. - This ambiguity makes it very difficult to perform 3D occupancy prediction relying solely on monocular images. 2. **Limitations of existing methods**: - Although stereo vision can enhance depth estimation, it is limited by the need for calibration and recalibration in practical applications, and it is difficult to be widely applied in autonomous vehicles and robotic systems. - Existing temporal fusion methods (such as self - attention mechanism, Warp and Concat, etc.) are relatively implicit when using temporal information and fail to fully utilize geometric constraints, resulting in limited performance improvement. ### Proposed solutions To solve the above problems, the paper introduces a new method - CVT - Occ (Cost Volume Temporal Fusion for 3D Occupancy Prediction), which improves 3D occupancy prediction in the following ways: 1. **Temporal fusion**: - Utilize the temporal geometric correspondences of multi - frame historical data. By sampling points along the line - of - sight direction for each voxel and combining the historical features of these points, a cost volume feature map is constructed. - In this way, CVT - Occ explicitly utilizes parallax cues, thus inferring depth information more accurately and reducing the ambiguity of depth estimation. 2. **Efficient cost volume construction**: - Different from the traditional method of calculating the cost volume for each pair of images, CVT - Occ avoids excessive computational overhead and achieves higher efficiency. - By learning the cost volume in a data - driven manner, the generalization ability and accuracy of the model are further improved. 3. **Experimental verification**: - The paper conducts strict experimental verification on the Occ3D - Waymo dataset. The results show that CVT - Occ significantly outperforms the existing state - of - the - art methods in the 3D occupancy prediction task and has the least additional computational cost. In conclusion, CVT - Occ solves the limitations of monocular vision in depth estimation by introducing temporal geometric correspondences and an efficient cost volume construction method, and significantly improves the accuracy of 3D occupancy prediction.

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

OVO: Open-Vocabulary Occupancy

FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View