Abstract:Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations faced by existing autonomous driving agents based on foundation models (FMs) when dealing with complex real - world driving scenarios. Specifically, these problems include: 1. **Long - time - horizon navigation tasks**: Existing FM agents are mainly trained on simple action - level natural language instructions and perform poorly in handling goal - level instructions that require path planning and map knowledge, and are unable to cope with long - time - horizon navigation tasks. 2. **Multi - turn dialogue interaction**: Most current systems only focus on individual instructions in single - turn dialogues, while human - machine interactions in real life usually involve free - form multi - turn dialogues, especially when dealing with unexpected situations caused by sensor limitations, environmental dynamics or task changes. 3. **Environmental perception and understanding**: Existing FM agents have deficiencies in visual understanding and multi - modal data processing, and it is difficult to effectively perceive and understand complex driving environments. 4. **Real - time response ability**: Existing systems have problems in inference time, training data balance, etc., which affect their real - time response ability in practical applications. To solve these problems, the paper introduces DriVLMe, an autonomous driving agent based on video - language models, which aims to enhance the agent's capabilities by simulating embodied experiences in the simulated environment and social experiences in real human conversations, thereby achieving more natural and effective communication between the driver and the vehicle and improving its navigation ability in complex environments. ### Specific problem summary: - **Insufficient ability to handle long - time - horizon navigation tasks**. - **Inadequate support for multi - turn dialogue interaction**. - **Limitations in environmental perception and understanding**. - **Poor real - time response ability**. ### Solutions: - **Introduce DriVLMe**: Combine embodied experiences and social experiences to improve the navigation and dialogue capabilities of autonomous driving agents in complex environments. - **Improve the model architecture**: Include video tokenizers, route planning modules and large - language - model backbones to improve the model's understanding of visual information and dialogue history. - **Optimize the training process**: Through domain - video - instruction tuning, social - instruction tuning and embodied - experience tuning, make the model better adapt to actual driving scenarios. Through these improvements, DriVLMe can provide more natural and effective communication and navigation services in more complex driving environments.

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers' Driving-thinking Data

Embodied Understanding of Driving Scenarios

A Language Agent for Autonomous Driving

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles

SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model

AgentsCoDriver: Large Language Model Empowered Collaborative Driving with Lifelong Learning

Personalized Autonomous Driving with Large Language Models: Field Experiments

Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

DriveLM: Driving with Graph Visual Question Answering

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Probing Multimodal LLMs as World Models for Driving

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models