Abstract:We need to trust robots that use often opaque AI methods. They need to explain themselves to us, and we need to trust their explanation. In this regard, explainability plays a critical role in trustworthy autonomous decision-making to foster transparency and acceptance among end users, especially in complex autonomous driving. Recent advancements in Multi-Modal Large Language models (MLLMs) have shown promising potential in enhancing the explainability as a driving agent by producing control predictions along with natural language explanations. However, severe data scarcity due to expensive annotation costs and significant domain gaps between different datasets makes the development of a robust and generalisable system an extremely challenging task. Moreover, the prohibitively expensive training requirements of MLLM and the unsolved problem of catastrophic forgetting further limit their generalisability post-deployment. To address these challenges, we present RAG-Driver, a novel retrieval-augmented multi-modal large language model that leverages in-context learning for high-performance, explainable, and generalisable autonomous driving. By grounding in retrieved expert demonstration, we empirically validate that RAG-Driver achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal prediction. More importantly, it exhibits exceptional zero-shot generalisation capabilities to unseen environments without further training endeavours.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to improve the interpretability and generalization ability of models in the field of autonomous driving, especially when facing unseen driving environments. Specifically, the paper focuses on the following points: 1. **Explainability**: - Autonomous driving systems are usually regarded as "black boxes", and it is difficult to understand their decision - making processes. In order to enhance users' trust in the system, the system needs to be able to explain its own behavior and provide reasonable reasons. - Traditional explanation methods such as attention visualization and intermediate tasks (such as semantic segmentation, object detection, etc.) are helpful for decoding the decision - making process, but these methods are not intuitive enough for ordinary users and cannot effectively build trust. 2. **Generalisation**: - Existing multi - modal large language models (MLLMs) perform poorly in new environments, mainly due to data scarcity, large domain differences between different data sets, high training costs and catastrophic forgetting. - Training a model that can perform well in various environments is very challenging, especially in the absence of additional labeled data. To solve these problems, the paper proposes a new retrieval - enhanced multi - modal large language model named **RAG - Driver**. By introducing retrieval - enhanced in - context learning (Retrieval - Augmented In - Context Learning, RA - ICL), this model significantly improves the interpretability and generalization performance of the model in unseen driving environments. ### Main contributions 1. **Proposing a retrieval - enhanced in - context learning method**: By retrieving similar driving scenarios from the memory bank as context information, the prediction and explanation ability of the model is enhanced. 2. **Achieving state - of - the - art self - introspective driving explanation performance on the standard benchmark BDD - X**: It performs excellently in explaining driving behaviors and providing reasonable bases. 3. **Demonstrating excellent zero - sample generalization ability**: In unseen driving environments, it can generate high - quality explanation texts and control signal predictions without retraining. Through these improvements, RAG - Driver not only improves the transparency and credibility of the autonomous driving system, but also shows strong adaptability in complex and changeable driving environments.

RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

A Language Agent for Autonomous Driving

NLE-DM: Natural-Language Explanations for Decision Making of Autonomous Driving Based on Semantic Scene Understanding

RAG-based Explainable Prediction of Road Users Behaviors for Automated Driving using Knowledge Graphs and Large Language Models

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles

Attention-Based Interrelation Modeling for Explainable Automated Driving

DRIVE: Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving

Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers' Driving-thinking Data

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

ADriver-I: A General World Model for Autonomous Driving

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Textual Explanations for Automated Commentary Driving

AgentsCoDriver: Large Language Model Empowered Collaborative Driving with Lifelong Learning

Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM