RenderWorld: World Model with Self-Supervised 3D Label

Ziyang Yan,Wenzhen Dong,Yihua Shao,Yuhang Lu,Liu Haiyang,Jingwen Liu,Haozhe Wang,Zhe Wang,Yan Wang,Fabio Remondino,Yuexin Ma

2024-09-18

Abstract:End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The main problem this paper attempts to address is the perception, prediction, and planning issues in pure vision-based autonomous driving systems. Specifically, the authors propose an end-to-end autonomous driving framework called RenderWorld, which aims to enhance system performance through the following points: 1. **Pure Vision Perception**: Compared to traditional LiDAR and vision fusion methods, pure vision methods are not only more cost-effective but also more reliable. RenderWorld achieves pure vision perception by generating 3D occupancy labels through a self-supervised Gaussian-based Img2Occ module. 2. **High-Precision 3D Scene Representation**: RenderWorld uses Gaussian Splatting technology to represent 3D scenes and render 2D images, significantly improving segmentation accuracy while reducing GPU memory consumption. 3. **Fine-Grained Scene Element Representation**: By introducing the Air Mask Variational Autoencoder (AM-V AE), RenderWorld can separately encode air and non-air voxels, achieving finer-grained scene element representation and enhancing the performance of 4D occupancy prediction and motion planning. 4. **Efficient 4D Occupancy Prediction and Motion Planning**: RenderWorld utilizes a world model for future scene prediction and vehicle decision-making. It generates high-dimensional scene tokens in a self-supervised manner, achieving accurate autoregressive prediction and vehicle path planning. In summary, RenderWorld aims to enhance the perception, prediction, and planning capabilities of autonomous driving systems through pure vision input, combined with efficient 3D scene representation and fine-grained scene element encoding.

RenderWorld: World Model with Self-Supervised 3D Label

Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

Enhancing End-to-End Autonomous Driving with Latent World Model

Large-Scale Neural Scene Disentanglement Approach for Self-Driving Simulation

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

ADriver-I: A General World Model for Autonomous Driving

MUVO: A Multimodal World Model with Spatial Representations for Autonomous Driving

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

UniWorld: Autonomous Driving Pre-training via World Models

GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Vehicle Perception from a Single Image for Autonomous Driving Using Deformable Model Representation and Deep Learning

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability