Abstract:3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: <a class="link-external link-https" href="https://github.com/huang-yh/SelfOcc" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the absence of 3D - annotated data, how to learn and predict the 3D occupancy state (3D occupancy prediction) of a scene by using only video sequences as supervision signals. Specifically, the paper proposes a self - supervised method named SelfOcc, which aims to learn 3D occupancy representations from video sequences and is applied in the autonomous driving scenario. ### Problem Background 3D occupancy prediction is an important task in the field of autonomous driving. Its goal is to predict whether each point in the surrounding 3D space is occupied. Existing methods usually rely on 3D - annotated data (such as LiDAR point clouds or dense occupancy labels) to generate meaningful results. However, the acquisition of these 3D - annotated data is very time - consuming and costly, and it is difficult to scale up on a large scale. ### Core Problems of the Paper The main problem proposed in the paper is: how to use only video - sequence data to achieve 3D occupancy prediction through self - supervised learning. Specifically, the paper hopes to solve the following key challenges: 1. **No 3D - Annotated Data**: How to learn 3D occupancy representations relying only on the spatio - temporal consistency in video sequences in the absence of 3D annotations. 2. **Multi - View Consistency**: How to use the multi - view information in video sequences to optimize the accuracy of 3D occupancy prediction. 3. **Geometric Reasoning**: How to ensure that the predicted 3D occupancy results have a reasonable geometric structure, especially when dealing with complex scenes. ### Solution Overview To solve the above problems, the paper proposes the SelfOcc model, which mainly contains the following innovations: 1. **Conversion from Image to 3D Representation**: By elevating 2D image features to 3D space (such as Bird - Eye - View (BEV) or Three - Perspective - View (TPV)), 3D feature interaction is achieved, and ambiguities caused by multiple cameras are avoided. 2. **Introduction of Self - Supervised Signals**: Utilize the temporal consistency in video sequences, and project the 3D occupancy prediction back to the 2D view through differentiable volume rendering, thereby providing self - supervised signals. 3. **MVS Embedding Strategy**: Propose a Multi - View Stereo (MVS) embedding strategy to directly optimize the weight values induced by the Signed Distance Field (SDF) to improve the accuracy of depth prediction. 4. **Regularization and Constraints**: Introduce regularization methods such as Hessian loss and Eikonal terms to ensure the smoothness of the SDF field and the consistency of physical meaning. ### Experimental Verification The paper has carried out experimental verification on multiple public datasets, including SemanticKITTI, nuScenes, etc., demonstrating the superior performance of SelfOcc in tasks such as 3D occupancy prediction and depth synthesis. The experimental results show that SelfOcc can not only achieve good prediction results in the absence of 3D annotations, but also outperform existing methods in multiple tasks. ### Summary By proposing the SelfOcc model, this paper successfully solves the problem of 3D occupancy prediction in the absence of 3D - annotated data, providing new ideas and technical means for 3D perception in the field of autonomous driving.

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

MonoOcc: Digging into Monocular Semantic Occupancy Prediction

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Monocular Occupancy Prediction for Scalable Indoor Scenes

OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

AdaptiveOcc: Adaptive Octree-based Network for Multi-Camera 3D Semantic Occupancy Prediction in Autonomous Driving

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

Learning-based 3D Occupancy Prediction for Autonomous Navigation in Occluded Environments