Abstract:The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.
What problem does this paper attempt to address?
The problem this paper attempts to address is the limitations of current Autonomous Driving Systems (ADS) in handling complex real-world driving scenarios. Specifically, these issues include:
1. **Overall Understanding and Interpretation**: Existing data-driven autonomous driving systems often fail to comprehensively understand dynamic and complex driving scenarios, especially in long-tail distribution scenarios in open-world driving environments. For example, when a ball rolls onto the road and a child runs after it to retrieve the ball, a human driver can immediately assess the potential danger and take appropriate action, whereas current autonomous driving systems may require a large amount of similar data to accurately interpret such a scenario.
2. **Instant Learning and Adaptation**: Unlike human drivers, existing autonomous driving systems require a large amount of training data to handle new situations. For instance, a human driver can quickly learn how to navigate around a new road obstacle after encountering it once or twice, while an autonomous driving system may need multiple similar scenarios to learn the same lesson.
3. **Reflection and Error Recovery**: Current autonomous driving systems typically employ feedforward processing during operation, lacking real-time correction capabilities based on feedback and guidance. In contrast, human drivers can adjust their driving behavior in real-time based on feedback. For example, if a human driver takes a wrong turn, they can quickly adjust their decision based on the error feedback, whereas an autonomous driving system may struggle to recover quickly from mistakes.
To overcome these limitations, the paper proposes a novel Vision-Language Model (VLM) named Dolphins, designed as a conversational driving assistant to bridge the gap between existing autonomous driving systems and human driving. Dolphins generate corresponding outputs through multimodal inputs (including video, text instructions, and historical control signals) and have the following features:
- **Comprehensive Understanding of Complex Driving Scenarios**: Dolphins can handle complex open-world driving scenarios and address a range of autonomous driving tasks.
- **Human-Level Capabilities**: Dolphins demonstrate human-level capabilities such as instant learning, adaptation, reflection, and reasoning, significantly enhancing the performance of autonomous driving systems.
With these features, Dolphins excel in multiple tasks such as perception, prediction, and planning, effectively handling both static and dynamic scenarios, integrating environmental factors, and efficiently completing downstream prediction and planning tasks. Additionally, Dolphins possess contextual learning abilities, quickly adapting to new driving conditions and improving model accuracy and reliability through error recovery mechanisms.