Abstract:A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of understanding complex and long - tailed scenarios faced by autonomous driving in urban environments. Specifically, the authors propose a new system - **DriveVLM**, which uses Vision - Language Models (VLMs) to enhance the understanding and planning ability of driving scenarios. #### Main challenges: 1. **Understanding of complex and long - tailed scenarios**: In urban environments, autonomous driving faces many complex and rare scenarios, such as poor road conditions and subtle human behaviors. These scenarios are difficult to accurately understand and handle with traditional methods. 2. **Limitations of existing systems**: Existing autonomous driving systems usually include 3D perception, motion prediction and planning modules, but they perform poorly when dealing with complex and unpredictable scenarios. For example, 3D perception can only detect familiar objects and ignores rare objects and their unique properties; motion prediction and planning mainly focus on trajectory - level actions and ignore the decision - level interaction between objects and vehicles. #### Solutions: To solve the above problems, the authors propose the following solutions: 1. **DriveVLM**: This is a new type of autonomous driving system based on VLM. It improves the understanding and planning ability of complex scenarios through three key modules - scene description, scene analysis and hierarchical planning. - **Scene description module**: Describes the driving environment in language and identifies the key objects in the scene. - **Scene analysis module**: Analyzes the characteristics of key objects and their impact on the host vehicle in - depth. - **Hierarchical planning module**: Formulates plans from macroscopic actions to specific waypoints step by step. 2. **DriveVLM - Dual**: To address the limitations of VLM in terms of spatial reasoning and computational requirements, the authors further propose a hybrid system, DriveVLM - Dual. This system combines the advantages of DriveVLM and the traditional autonomous driving pipeline to achieve more accurate spatial reasoning and real - time planning capabilities. 3. **New data set and evaluation metrics**: To better train and evaluate these models, the authors construct a new scene understanding and planning data set (SUP - AD) and propose new evaluation metrics to measure the model's ability in scene analysis and meta - action planning. #### Experimental results: Experiments show that DriveVLM and DriveVLM - Dual perform well on both the nuScenes data set and the self - built SUP - AD data set, especially when dealing with complex and unpredictable driving conditions. In addition, DriveVLM - Dual has been deployed on production vehicles, verifying its effectiveness in the actual autonomous driving environment. In conclusion, this paper significantly improves the understanding and planning ability of autonomous driving systems for complex scenarios by introducing VLM technology, and solves the shortcomings of existing systems in dealing with complex and long - tailed scenarios.

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

DriveLM: Driving with Graph Visual Question Answering

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

A Survey on Multimodal Large Language Models for Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation

DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

HE-Drive: Human-Like End-to-End Driving with Vision Language Models

Vision Language Models in Autonomous Driving: A Survey and Outlook

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

Embodied Understanding of Driving Scenarios

LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Unsupervised Real-to-Virtual Domain Unification for End-to-End Highway Driving.

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases