RenderWorld: World Model with Self-Supervised 3D Label

Ziyang Yan,Wenzhen Dong,Yihua Shao,Yuhang Lu,Liu Haiyang,Jingwen Liu,Haozhe Wang,Zhe Wang,Yan Wang,Fabio Remondino,Yuexin Ma
2024-09-18
Abstract:End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem this paper attempts to address is the perception, prediction, and planning issues in pure vision-based autonomous driving systems. Specifically, the authors propose an end-to-end autonomous driving framework called RenderWorld, which aims to enhance system performance through the following points: 1. **Pure Vision Perception**: Compared to traditional LiDAR and vision fusion methods, pure vision methods are not only more cost-effective but also more reliable. RenderWorld achieves pure vision perception by generating 3D occupancy labels through a self-supervised Gaussian-based Img2Occ module. 2. **High-Precision 3D Scene Representation**: RenderWorld uses Gaussian Splatting technology to represent 3D scenes and render 2D images, significantly improving segmentation accuracy while reducing GPU memory consumption. 3. **Fine-Grained Scene Element Representation**: By introducing the Air Mask Variational Autoencoder (AM-V AE), RenderWorld can separately encode air and non-air voxels, achieving finer-grained scene element representation and enhancing the performance of 4D occupancy prediction and motion planning. 4. **Efficient 4D Occupancy Prediction and Motion Planning**: RenderWorld utilizes a world model for future scene prediction and vehicle decision-making. It generates high-dimensional scene tokens in a self-supervised manner, achieving accurate autoregressive prediction and vehicle path planning. In summary, RenderWorld aims to enhance the perception, prediction, and planning capabilities of autonomous driving systems through pure vision input, combined with efficient 3D scene representation and fine-grained scene element encoding.