DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

Yidong Huang,Jacob Sansom,Ziqiao Ma,Felix Gervits,Joyce Chai
2024-10-15
Abstract:Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations faced by existing autonomous driving agents based on foundation models (FMs) when dealing with complex real - world driving scenarios. Specifically, these problems include: 1. **Long - time - horizon navigation tasks**: Existing FM agents are mainly trained on simple action - level natural language instructions and perform poorly in handling goal - level instructions that require path planning and map knowledge, and are unable to cope with long - time - horizon navigation tasks. 2. **Multi - turn dialogue interaction**: Most current systems only focus on individual instructions in single - turn dialogues, while human - machine interactions in real life usually involve free - form multi - turn dialogues, especially when dealing with unexpected situations caused by sensor limitations, environmental dynamics or task changes. 3. **Environmental perception and understanding**: Existing FM agents have deficiencies in visual understanding and multi - modal data processing, and it is difficult to effectively perceive and understand complex driving environments. 4. **Real - time response ability**: Existing systems have problems in inference time, training data balance, etc., which affect their real - time response ability in practical applications. To solve these problems, the paper introduces DriVLMe, an autonomous driving agent based on video - language models, which aims to enhance the agent's capabilities by simulating embodied experiences in the simulated environment and social experiences in real human conversations, thereby achieving more natural and effective communication between the driver and the vehicle and improving its navigation ability in complex environments. ### Specific problem summary: - **Insufficient ability to handle long - time - horizon navigation tasks**. - **Inadequate support for multi - turn dialogue interaction**. - **Limitations in environmental perception and understanding**. - **Poor real - time response ability**. ### Solutions: - **Introduce DriVLMe**: Combine embodied experiences and social experiences to improve the navigation and dialogue capabilities of autonomous driving agents in complex environments. - **Improve the model architecture**: Include video tokenizers, route planning modules and large - language - model backbones to improve the model's understanding of visual information and dialogue history. - **Optimize the training process**: Through domain - video - instruction tuning, social - instruction tuning and embodied - experience tuning, make the model better adapt to actual driving scenarios. Through these improvements, DriVLMe can provide more natural and effective communication and navigation services in more complex driving environments.