Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

Lunjun Zhang,Yuwen Xiong,Ze Yang,Sergio Casas,Rui Hu,Raquel Urtasun
2024-04-01
Abstract:Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Robotics
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to construct an unsupervised world model in autonomous driving scenarios that can directly predict future observations (such as point cloud data). Specifically, the paper points out that current autonomous driving prediction systems still rely on supervised learning, such as methods based on bounding boxes, semantic segmentation, or instance segmentation. However, if a world model can accurately predict future unlabelled observations, it must already have a comprehensive understanding of the scene's geometric structure and dynamics. Therefore, the paper proposes a new method called Copilot4D, which predicts future point cloud observations through a discrete diffusion model, thereby achieving unsupervised learning. The paper mainly addresses the following two bottleneck problems: 1. **Complex observation space**: Observation data in autonomous driving (such as point clouds) are usually complex and unstructured. Choosing an appropriate loss function and constructing a generative model that can capture meaningful likelihoods is very challenging. To this end, the paper uses VQ-VAE to tokenize the input data, converting it into discrete representations. 2. **Scalable generative model**: Traditional generative models (such as language models) perform well in natural language processing but have poor scalability in applications such as autonomous driving. Especially when dealing with large amounts of observation data, parallel decoding and denoising are required. The paper rephrases the Masked Generative Image Transformer (MaskGIT) as a discrete diffusion model and makes some improvements to achieve efficient parallel decoding and denoising. Through these methods, Copilot4D significantly outperforms previous state-of-the-art methods on multiple datasets, especially in 1-second and 3-second prediction tasks.