RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

Jianhao Yuan,Shuyang Sun,Daniel Omeiza,Bo Zhao,Paul Newman,Lars Kunze,Matthew Gadd
2024-05-29
Abstract:We need to trust robots that use often opaque AI methods. They need to explain themselves to us, and we need to trust their explanation. In this regard, explainability plays a critical role in trustworthy autonomous decision-making to foster transparency and acceptance among end users, especially in complex autonomous driving. Recent advancements in Multi-Modal Large Language models (MLLMs) have shown promising potential in enhancing the explainability as a driving agent by producing control predictions along with natural language explanations. However, severe data scarcity due to expensive annotation costs and significant domain gaps between different datasets makes the development of a robust and generalisable system an extremely challenging task. Moreover, the prohibitively expensive training requirements of MLLM and the unsolved problem of catastrophic forgetting further limit their generalisability post-deployment. To address these challenges, we present RAG-Driver, a novel retrieval-augmented multi-modal large language model that leverages in-context learning for high-performance, explainable, and generalisable autonomous driving. By grounding in retrieved expert demonstration, we empirically validate that RAG-Driver achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal prediction. More importantly, it exhibits exceptional zero-shot generalisation capabilities to unseen environments without further training endeavours.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to improve the interpretability and generalization ability of models in the field of autonomous driving, especially when facing unseen driving environments. Specifically, the paper focuses on the following points: 1. **Explainability**: - Autonomous driving systems are usually regarded as "black boxes", and it is difficult to understand their decision - making processes. In order to enhance users' trust in the system, the system needs to be able to explain its own behavior and provide reasonable reasons. - Traditional explanation methods such as attention visualization and intermediate tasks (such as semantic segmentation, object detection, etc.) are helpful for decoding the decision - making process, but these methods are not intuitive enough for ordinary users and cannot effectively build trust. 2. **Generalisation**: - Existing multi - modal large language models (MLLMs) perform poorly in new environments, mainly due to data scarcity, large domain differences between different data sets, high training costs and catastrophic forgetting. - Training a model that can perform well in various environments is very challenging, especially in the absence of additional labeled data. To solve these problems, the paper proposes a new retrieval - enhanced multi - modal large language model named **RAG - Driver**. By introducing retrieval - enhanced in - context learning (Retrieval - Augmented In - Context Learning, RA - ICL), this model significantly improves the interpretability and generalization performance of the model in unseen driving environments. ### Main contributions 1. **Proposing a retrieval - enhanced in - context learning method**: By retrieving similar driving scenarios from the memory bank as context information, the prediction and explanation ability of the model is enhanced. 2. **Achieving state - of - the - art self - introspective driving explanation performance on the standard benchmark BDD - X**: It performs excellently in explaining driving behaviors and providing reasonable bases. 3. **Demonstrating excellent zero - sample generalization ability**: In unseen driving environments, it can generate high - quality explanation texts and control signal predictions without retraining. Through these improvements, RAG - Driver not only improves the transparency and credibility of the autonomous driving system, but also shows strong adaptability in complex and changeable driving environments.