Abstract:In this technical report, we present our solution, named UniOCC, for the Vision-Centric 3D occupancy prediction track in the nuScenes Open Dataset Challenge at CVPR 2023. Existing methods for occupancy prediction primarily focus on optimizing projected features on 3D volume space using 3D occupancy labels. However, the generation process of these labels is complex and expensive (relying on 3D semantic annotations), and limited by voxel resolution, they cannot provide fine-grained spatial semantics. To address this limitation, we propose a novel Unifying Occupancy (UniOcc) prediction method, explicitly imposing spatial geometry constraint and complementing fine-grained semantic supervision through volume ray rendering. Our method significantly enhances model performance and demonstrates promising potential in reducing human annotation costs. Given the laborious nature of annotating 3D occupancy, we further introduce a Depth-aware Teacher Student (DTS) framework to enhance prediction accuracy using unlabeled data. Our solution achieves 51.27\% mIoU on the official leaderboard with single model, placing 3rd in this challenge.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to accurately predict occupancy and semantic information in 3D space using multi-view images in autonomous driving scene perception. Specifically, existing methods mainly rely on 3D occupancy labels to optimize features in the 3D volumetric space, but the generation process of these labels is complex and expensive, and limited by voxel resolution, they cannot provide fine-grained spatial semantic information. To solve these problems, the authors propose a new method called UniOcc, which combines geometric constraints and fine-grained semantic supervision through volumetric rendering techniques, significantly improving model performance and demonstrating the potential to reduce manual annotation costs. Additionally, to further improve prediction accuracy, the authors introduce a Depth-aware Teacher Student (DTS) framework, which utilizes unlabeled data for self-supervised training. ### Main Contributions: 1. **Unified Occupancy Prediction Method**: By combining geometric constraints and fine-grained semantic supervision through volumetric rendering techniques, model performance is improved. 2. **Reduced Dependence on Expensive 3D Annotations**: The model can be trained using only 2D segmentation labels, achieving or even surpassing the performance of models using 3D annotations. 3. **Depth-aware Teacher Student Framework**: Utilizes unlabeled data for self-supervised training, further enhancing the accuracy of model predictions. ### Experimental Results: - On the official leaderboard, the single model achieved 51.27% mIoU, ranking third. - Through a series of improvements, such as using visibility masks, stronger pre-trained models, increasing voxel resolution, and test-time augmentation, the final model achieved 52.1% mIoU on the validation set. ### Summary: This paper proposes a new 3D occupancy prediction method that effectively addresses the high annotation cost and lack of fine-grained semantic information in existing methods through volumetric rendering techniques and a self-supervised learning framework, providing a new solution for autonomous driving scene perception.

UniOcc: Unifying Vision-Centric 3D Occupancy Prediction with Geometric and Semantic Rendering

UniVision: A Unified Framework for Vision-Centric 3D Perception

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

MonoOcc: Digging into Monocular Semantic Occupancy Prediction

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

$α$-OCC: Uncertainty-Aware Camera-based 3D Semantic Occupancy Prediction

Monocular Occupancy Prediction for Scalable Indoor Scenes

Multi-Scale Occ: 4th Place Solution for CVPR 2023 3D Occupancy Prediction Challenge

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Learning Occupancy for Monocular 3D Object Detection