Abstract:The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.

What problem does this paper attempt to address?

The problem this paper attempts to address is the limitations of current Autonomous Driving Systems (ADS) in handling complex real-world driving scenarios. Specifically, these issues include: 1. **Overall Understanding and Interpretation**: Existing data-driven autonomous driving systems often fail to comprehensively understand dynamic and complex driving scenarios, especially in long-tail distribution scenarios in open-world driving environments. For example, when a ball rolls onto the road and a child runs after it to retrieve the ball, a human driver can immediately assess the potential danger and take appropriate action, whereas current autonomous driving systems may require a large amount of similar data to accurately interpret such a scenario. 2. **Instant Learning and Adaptation**: Unlike human drivers, existing autonomous driving systems require a large amount of training data to handle new situations. For instance, a human driver can quickly learn how to navigate around a new road obstacle after encountering it once or twice, while an autonomous driving system may need multiple similar scenarios to learn the same lesson. 3. **Reflection and Error Recovery**: Current autonomous driving systems typically employ feedforward processing during operation, lacking real-time correction capabilities based on feedback and guidance. In contrast, human drivers can adjust their driving behavior in real-time based on feedback. For example, if a human driver takes a wrong turn, they can quickly adjust their decision based on the error feedback, whereas an autonomous driving system may struggle to recover quickly from mistakes. To overcome these limitations, the paper proposes a novel Vision-Language Model (VLM) named Dolphins, designed as a conversational driving assistant to bridge the gap between existing autonomous driving systems and human driving. Dolphins generate corresponding outputs through multimodal inputs (including video, text instructions, and historical control signals) and have the following features: - **Comprehensive Understanding of Complex Driving Scenarios**: Dolphins can handle complex open-world driving scenarios and address a range of autonomous driving tasks. - **Human-Level Capabilities**: Dolphins demonstrate human-level capabilities such as instant learning, adaptation, reflection, and reasoning, significantly enhancing the performance of autonomous driving systems. With these features, Dolphins excel in multiple tasks such as perception, prediction, and planning, effectively handling both static and dynamic scenarios, integrating environmental factors, and efficiently completing downstream prediction and planning tasks. Additionally, Dolphins possess contextual learning abilities, quickly adapting to new driving conditions and improving model accuracy and reliability through error recovery mechanisms.

Dolphins: Multimodal Language Model for Driving

Dolphin: a task orchestration language for autonomous vehicle networks

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles

Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

Probing Multimodal LLMs as World Models for Driving

Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-Making Framework

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

ADriver-I: A General World Model for Autonomous Driving

A Language Agent for Autonomous Driving

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

AgentsCoDriver: Large Language Model Empowered Collaborative Driving with Lifelong Learning

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving