Abstract:The advancement of autonomous driving is increasingly reliant on high-quality annotated datasets, especially in the task of 3D occupancy prediction, where the occupancy labels require dense 3D annotation with significant human effort. In this paper, we propose SyntheOcc, which denotes a diffusion model that Synthesize photorealistic and geometric-controlled images by conditioning Occupancy labels in driving scenarios. This yields an unlimited amount of diverse, annotated, and controllable datasets for applications like training perception models and simulation. SyntheOcc addresses the critical challenge of how to efficiently encode 3D geometric information as conditional input to a 2D diffusion model. Our approach innovatively incorporates 3D semantic multi-plane images (MPIs) to provide comprehensive and spatially aligned 3D scene descriptions for conditioning. As a result, SyntheOcc can generate photorealistic multi-view images and videos that faithfully align with the given geometric labels (semantics in 3D voxel space). Extensive qualitative and quantitative evaluations of SyntheOcc on the nuScenes dataset prove its effectiveness in generating controllable occupancy datasets that serve as an effective data augmentation to perception models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the field of autonomous driving, how to efficiently generate high - quality annotated datasets with 3D occupancy labels. Specifically, the paper proposes a method named SyntheOcc, aiming to synthesize realistic and controllable street - view images and videos by conditioning 3D geometric information. This not only helps to reduce the workload of manual annotation, but also provides a large amount of diverse annotated data for training perception models and simulation. ### Main Problem Background 1. **Requirement for High - Quality Annotated Data**: - The development of autonomous driving technology depends on high - precision annotated datasets. Especially in 3D occupancy prediction tasks, dense 3D annotations are required, which usually consumes a great deal of manpower. 2. **Limitations of Existing Methods**: - Although existing generative models can generate realistic images, they have limitations in 3D geometric control. Especially when dealing with complex 3D scenes, it is difficult to achieve fine - grained geometric control. - For example, methods such as BEVGen can only generate street - view images based on BEV layouts, and cannot precisely edit 3D voxels, thus limiting their application in generating complex scenes. ### SyntheOcc Solution SyntheOcc proposes an innovative method. By introducing 3D semantic multi - plane images (MPIs) to represent 3D scenes and using them as conditional inputs to the diffusion model, more fine - grained 3D geometric control is achieved. Specific contributions include: 1. **3D Semantic Multi - Plane Images (MPIs)**: - Use MPIs to represent occupancy information in 3D scenes. Each plane represents semantic labels at a specific depth. This method not only retains accurate 3D information but also ensures spatial alignment with the generated images. 2. **MPI Encoder**: - Design an MPI encoder to convert MPI features into latent - space features suitable for the diffusion model, thereby improving the quality and recognizability of the generated images. 3. **Re - weighting Strategy**: - Introduce multiple re - weighting methods (such as progressive foreground enhancement, depth - aware re - weighting, and class - balanced sampling) to deal with the class - imbalance problem and long - tail distribution problem during the training process. ### Application Scenarios - **Dataset Generation**: SyntheOcc can generate a large number of annotated street - view images and videos for training perception models. - **Rare - Scene Generation**: Users can create rare scenes (such as traffic cones blocking the road) by editing 3D voxels, thereby evaluating the robustness of the autonomous driving system. - **Perception - Model Improvement**: Experimental results show that the synthetic data generated by SyntheOcc has a good effect in 3D occupancy prediction tasks and can effectively improve the performance of perception models. In conclusion, SyntheOcc solves the deficiencies of existing methods in 3D geometric control, provides a new way to efficiently generate high - quality annotated data, and promotes the development of autonomous driving technology.

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Learning to Simulate Complex Scenes for Street Scene Segmentation

Learning 3 D Scene Synthesis from Annotated RGB-D Images

AdaptiveOcc: Adaptive Octree-based Network for Multi-Camera 3D Semantic Occupancy Prediction in Autonomous Driving

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

SceneSense: Diffusion Models for 3D Occupancy Synthesis from Partial Observation

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Scene as Occupancy

UniScene: Unified Occupancy-centric Driving Scene Generation

Towards Pragmatic Semantic Image Synthesis for Urban Scenes

Urban Scene Diffusion through Semantic Occupancy Map

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

AdaOcc: Adaptive-Resolution Occupancy Prediction

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

MonoOcc: Digging into Monocular Semantic Occupancy Prediction

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation