Doe-1: Closed-Loop Autonomous Driving with Large World Model

Wenzhao Zheng,Zetian Xia,Yuanhui Huang,Sicheng Zuo,Jie Zhou,Jiwen Lu
2024-12-13
Abstract:End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: <a class="link-external link-https" href="https://github.com/wzzheng/Doe" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key issues in existing autonomous driving methods, which limit the performance and reliability of autonomous driving systems. Specifically, the paper mainly focuses on the following three problems: 1. **Weak Scalability**: Most existing autonomous driving methods use manually - designed scene representations and descriptions. These methods are difficult to provide comprehensive information when the amount of data increases, thus limiting the performance improvement of downstream tasks. 2. **Lack of High - Order Interaction**: Existing autonomous driving pipelines usually predict multiple future scenes at once without considering the impact of the ego - vehicle's actions on the environmental development. This is unrealistic in interactive driving scenarios because the behaviors of other vehicles highly depend on the ego - vehicle's actions. 3. **Inefficient Decision - making**: Although existing methods can predict and plan multiple steps, in actual driving, the model usually only executes the next - step action and replans future actions according to new observations, resulting in inefficiency and redundancy in multi - step planning. To solve these problems, the paper proposes a new closed - loop autonomous driving paradigm and introduces a large - scale autonomous driving world model (Doe - 1). Doe - 1 regards the autonomous driving task as a multi - modal state autoregressive world evolution problem and directly predicts future scene evolution from the observation space through the self - attention mechanism without the need for intermediate scene representations. Doe - 1 can also make predictions conditional on the ego - vehicle's actions and generate multiple - step futures through autoregressive generation, but only performs immediate planning each time. ### Main Contributions 1. **Unified Multi - modal Autoregressive Model**: Doe - 1 uses a unified multi - modal autoregressive model to handle perception, prediction, and planning tasks, eliminating intermediate scene representations and improving the model's scalability and performance. 2. **High - Order Interaction Modeling**: Doe - 1 can consider the impact of the ego - vehicle's actions when predicting future scenes, thereby more accurately simulating the dynamic changes in interactive driving scenarios. 3. **Efficient Decision - making Mechanism**: Doe - 1 generates multiple - step futures through autoregressive generation but only performs immediate planning each time, avoiding the inefficiency and redundancy of multi - step planning. ### Technical Details - **Multi - modal Representations of Observation, Description, and Action**: Doe - 1 encodes observations (images), descriptions (texts), and actions (displacements) into discrete token sequences respectively, and then uses the Transformer architecture for autoregressive modeling. - **Autoregressive Generation**: Doe - 1 generates the next token through autoregressive generation and gradually generates future observations, descriptions, and actions. - **Multi - task Applications**: Doe - 1 can be applied to multiple tasks through different prompt settings, including visual question answering, video generation under action - conditional, and end - to - end motion planning. ### Experimental Verification The paper conducted experiments on the widely - used nuScenes dataset to verify the performance of Doe - 1 in various driving - related tasks. Although only using monocular camera input and high - level question - answering supervision, Doe - 1 still shows satisfactory performance. In conclusion, by proposing the Doe - 1 model, this paper aims to solve the problems of weak scalability, lack of high - order interaction, and inefficient decision - making in existing autonomous driving methods, thereby achieving a more reliable and efficient autonomous driving system.