DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Wenhai Wang,Jiangwei Xie,ChuanYang Hu,Haoming Zou,Jianan Fan,Wenwen Tong,Yang Wen,Silei Wu,Hanming Deng,Zhiqi Li,Hao Tian,Lewei Lu,Xizhou Zhu,Xiaogang Wang,Yu Qiao,Jifeng Dai
2023-12-25
Abstract:Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge the gap between the language decisions and the vehicle control commands by standardizing the decision states according to the off-the-shelf motion planning module. (2) We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system, which uses driving rules, user commands, and inputs from various sensors (e.g., camera, lidar) as input and makes driving decisions and provide explanations; This model can plug-and-play in existing AD systems such as Apollo for close-loop driving. (3) We design an effective data engine to collect a dataset that includes decision state and corresponding explanation annotation for model training and evaluation. We conduct extensive experiments and show that our model achieves 76.1 driving score on the CARLA Town05 Long, and surpasses the Apollo baseline by 4.7 points under the same settings, demonstrating the effectiveness of our model. We hope this work can serve as a baseline for autonomous driving with LLMs. Code and models shall be released at <a class="link-external link-https" href="https://github.com/OpenGVLab/DriveMLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to explore the potential applications of Large Language Models (LLMs) in the field of Autonomous Driving (AD), particularly by aligning Multimodal Large Language Models (MLLMs) with behavioral planning states to achieve closed-loop autonomous driving. The research team proposed a framework named DriveMLM, which enables AD systems based on LLMs to perform closed-loop driving tasks in a real-world simulator. To achieve this goal, the work in the paper focuses on the following three aspects: 1. **Behavioral Planning State Alignment**: Researchers analyzed the decision states of the mature Apollo autonomous driving system's behavioral planning module and standardized them so that LLMs can process these decision states. This allows the outputs of LLMs to be transformed into vehicle control signals, thus achieving seamless integration with existing AD systems. 2. **Multimodal LLM (MLLM) Planner Design**: A MLLM planner was developed that can receive multimodal inputs including multi-angle images, LiDAR point clouds, traffic rules, and user instructions, and predict driving decisions. In addition, the model can also provide decision explanations, enhancing the model's transparency and interpretability. 3. **Efficient Data Engine**: An effective data collection strategy was designed to generate datasets containing decision states and corresponding explanatory annotations to support model training and evaluation. The research team manually collected 280 hours of driving data on the CARLA simulator, converted into decision states and explanatory annotations, providing a rich data resource for model training. Experimental results show that the DriveMLM model achieved a driving score of 76.1 in the CARLA Town05 Long benchmark test, which is 4.7 points higher than the Apollo baseline, proving the model's effectiveness and superiority in the same setup. Moreover, the model can also adjust driving preferences through language instructions without changing the existing AD system structure, such as yielding to ambulances or ignoring red lights, demonstrating its flexibility and adaptability. In summary, DriveMLM not only bridges the gap between LLMs and closed-loop driving but also opens up new directions for the development of autonomous driving technology through multimodal data processing and decision alignment.