SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

Yuanhui Huang,Wenzhao Zheng,Borui Zhang,Jie Zhou,Jiwen Lu
2023-11-29
Abstract:3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: <a class="link-external link-https" href="https://github.com/huang-yh/SelfOcc" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the absence of 3D - annotated data, how to learn and predict the 3D occupancy state (3D occupancy prediction) of a scene by using only video sequences as supervision signals. Specifically, the paper proposes a self - supervised method named SelfOcc, which aims to learn 3D occupancy representations from video sequences and is applied in the autonomous driving scenario. ### Problem Background 3D occupancy prediction is an important task in the field of autonomous driving. Its goal is to predict whether each point in the surrounding 3D space is occupied. Existing methods usually rely on 3D - annotated data (such as LiDAR point clouds or dense occupancy labels) to generate meaningful results. However, the acquisition of these 3D - annotated data is very time - consuming and costly, and it is difficult to scale up on a large scale. ### Core Problems of the Paper The main problem proposed in the paper is: how to use only video - sequence data to achieve 3D occupancy prediction through self - supervised learning. Specifically, the paper hopes to solve the following key challenges: 1. **No 3D - Annotated Data**: How to learn 3D occupancy representations relying only on the spatio - temporal consistency in video sequences in the absence of 3D annotations. 2. **Multi - View Consistency**: How to use the multi - view information in video sequences to optimize the accuracy of 3D occupancy prediction. 3. **Geometric Reasoning**: How to ensure that the predicted 3D occupancy results have a reasonable geometric structure, especially when dealing with complex scenes. ### Solution Overview To solve the above problems, the paper proposes the SelfOcc model, which mainly contains the following innovations: 1. **Conversion from Image to 3D Representation**: By elevating 2D image features to 3D space (such as Bird - Eye - View (BEV) or Three - Perspective - View (TPV)), 3D feature interaction is achieved, and ambiguities caused by multiple cameras are avoided. 2. **Introduction of Self - Supervised Signals**: Utilize the temporal consistency in video sequences, and project the 3D occupancy prediction back to the 2D view through differentiable volume rendering, thereby providing self - supervised signals. 3. **MVS Embedding Strategy**: Propose a Multi - View Stereo (MVS) embedding strategy to directly optimize the weight values induced by the Signed Distance Field (SDF) to improve the accuracy of depth prediction. 4. **Regularization and Constraints**: Introduce regularization methods such as Hessian loss and Eikonal terms to ensure the smoothness of the SDF field and the consistency of physical meaning. ### Experimental Verification The paper has carried out experimental verification on multiple public datasets, including SemanticKITTI, nuScenes, etc., demonstrating the superior performance of SelfOcc in tasks such as 3D occupancy prediction and depth synthesis. The experimental results show that SelfOcc can not only achieve good prediction results in the absence of 3D annotations, but also outperform existing methods in multiple tasks. ### Summary By proposing the SelfOcc model, this paper successfully solves the problem of 3D occupancy prediction in the absence of 3D - annotated data, providing new ideas and technical means for 3D perception in the field of autonomous driving.