Abstract:Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to construct an unsupervised world model in autonomous driving scenarios that can directly predict future observations (such as point cloud data). Specifically, the paper points out that current autonomous driving prediction systems still rely on supervised learning, such as methods based on bounding boxes, semantic segmentation, or instance segmentation. However, if a world model can accurately predict future unlabelled observations, it must already have a comprehensive understanding of the scene's geometric structure and dynamics. Therefore, the paper proposes a new method called Copilot4D, which predicts future point cloud observations through a discrete diffusion model, thereby achieving unsupervised learning. The paper mainly addresses the following two bottleneck problems: 1. **Complex observation space**: Observation data in autonomous driving (such as point clouds) are usually complex and unstructured. Choosing an appropriate loss function and constructing a generative model that can capture meaningful likelihoods is very challenging. To this end, the paper uses VQ-VAE to tokenize the input data, converting it into discrete representations. 2. **Scalable generative model**: Traditional generative models (such as language models) perform well in natural language processing but have poor scalability in applications such as autonomous driving. Especially when dealing with large amounts of observation data, parallel decoding and denoising are required. The paper rephrases the Masked Generative Image Transformer (MaskGIT) as a discrete diffusion model and makes some improvements to achieve efficient parallel decoding and denoising. Through these methods, Copilot4D significantly outperforms previous state-of-the-art methods on multiple datasets, especially in 1-second and 3-second prediction tasks.

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

UniWorld: Autonomous Driving Pre-training via World Models

An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training

Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Enhanced Multimodal Trajectory Prediction for Autonomous Vehicles Using Advanced Diffusion Model Techniques

BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey

WcDT: World-centric Diffusion Transformer for Traffic Scene Generation

MUVO: A Multimodal World Model with Spatial Representations for Autonomous Driving

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

ADriver-I: A General World Model for Autonomous Driving