DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian,Junru Gu,Bailin Li,Yicheng Liu,Yang Wang,Zhiyong Zhao,Kun Zhan,Peng Jia,Xianpeng Lang,Hang Zhao
2024-06-26
Abstract:A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of understanding complex and long - tailed scenarios faced by autonomous driving in urban environments. Specifically, the authors propose a new system - **DriveVLM**, which uses Vision - Language Models (VLMs) to enhance the understanding and planning ability of driving scenarios. #### Main challenges: 1. **Understanding of complex and long - tailed scenarios**: In urban environments, autonomous driving faces many complex and rare scenarios, such as poor road conditions and subtle human behaviors. These scenarios are difficult to accurately understand and handle with traditional methods. 2. **Limitations of existing systems**: Existing autonomous driving systems usually include 3D perception, motion prediction and planning modules, but they perform poorly when dealing with complex and unpredictable scenarios. For example, 3D perception can only detect familiar objects and ignores rare objects and their unique properties; motion prediction and planning mainly focus on trajectory - level actions and ignore the decision - level interaction between objects and vehicles. #### Solutions: To solve the above problems, the authors propose the following solutions: 1. **DriveVLM**: This is a new type of autonomous driving system based on VLM. It improves the understanding and planning ability of complex scenarios through three key modules - scene description, scene analysis and hierarchical planning. - **Scene description module**: Describes the driving environment in language and identifies the key objects in the scene. - **Scene analysis module**: Analyzes the characteristics of key objects and their impact on the host vehicle in - depth. - **Hierarchical planning module**: Formulates plans from macroscopic actions to specific waypoints step by step. 2. **DriveVLM - Dual**: To address the limitations of VLM in terms of spatial reasoning and computational requirements, the authors further propose a hybrid system, DriveVLM - Dual. This system combines the advantages of DriveVLM and the traditional autonomous driving pipeline to achieve more accurate spatial reasoning and real - time planning capabilities. 3. **New data set and evaluation metrics**: To better train and evaluate these models, the authors construct a new scene understanding and planning data set (SUP - AD) and propose new evaluation metrics to measure the model's ability in scene analysis and meta - action planning. #### Experimental results: Experiments show that DriveVLM and DriveVLM - Dual perform well on both the nuScenes data set and the self - built SUP - AD data set, especially when dealing with complex and unpredictable driving conditions. In addition, DriveVLM - Dual has been deployed on production vehicles, verifying its effectiveness in the actual autonomous driving environment. In conclusion, this paper significantly improves the understanding and planning ability of autonomous driving systems for complex scenarios by introducing VLM technology, and solves the shortcomings of existing systems in dealing with complex and long - tailed scenarios.