DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

Zhenhua Xu,Yujia Zhang,Enze Xie,Zhen Zhao,Yong Guo,Kwan-Yee. K. Wong,Zhenguo Li,Hengshuang Zhao
2024-03-15
Abstract:Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion. These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V. The code and dataset will be publicly available.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of developing an interpretable end-to-end driving system in the field of autonomous driving. Specifically, the researchers propose a new model named DriveGPT4, which aims to leverage large language models (LLMs) to process multimodal data, thereby achieving natural language explanations of autonomous vehicle behavior and predictions of low-level control signals. DriveGPT4 is capable of predicting the next control signals (such as vehicle speed and steering angle) based on video sequences captured by the front-facing camera, as well as engaging in dialogue with human users to explain the vehicle's behavior and the logic behind it. Additionally, by combining a customized visual instruction tuning dataset and a hybrid fine-tuning strategy, the model enhances the system's transparency and interpretability while maintaining high performance. Evaluation results show that DriveGPT4 outperforms existing baseline methods on multiple tasks, particularly excelling in complex driving scenarios.