Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Long Chen,Oleg Sinavski,Jan Hünermann,Alice Karnsund,Andrew James Willmott,Danny Birch,Daniel Maund,Jamie Shotton
2023-10-14
Abstract:Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.
Robotics,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address two key issues in the field of autonomous driving: 1. **Interpretability**: - Modern autonomous driving systems are often viewed as "black boxes," making it particularly challenging to endow them with out-of-distribution (OOD) reasoning capabilities and interpretability during the decision-making process. The paper proposes a novel approach that enhances system interpretability by integrating object-level vector modalities with pre-trained language models (LLMs). 2. **Decision-Making and Action Generation**: - Autonomous driving systems need to make reasonable decisions and generate corresponding driving actions in complex driving environments. The paper introduces a new multimodal LLM architecture that can directly interpret and reason about complex driving situations and generate appropriate driving actions. ### Solutions To achieve the above goals, the paper proposes the following innovative methods and techniques: 1. **Novel Multimodal LLM Architecture**: - Integrates object-level vector modalities (such as the positions and speeds of vehicles and pedestrians) with pre-trained LLMs. Through a two-stage pre-training and fine-tuning process, the model can understand and handle complex information in driving scenarios. 2. **Large-Scale Driving Scenario QA Dataset**: - Constructs a dataset containing 160,000 QA pairs based on 10,000 driving scenarios. These QA pairs consist of high-quality control commands generated by reinforcement learning (RL) agents and QA pairs generated by teacher LLMs (such as GPT-3.5). 3. **Structured Language Generator**: - Uses a structured language generator (lanGen) to convert vector representations into human-readable language descriptions, aligning vector modalities with language modalities, enabling LLMs to better understand driving scenarios. 4. **Novel Evaluation Metrics**: - Introduces a new evaluation metric (Driving QA) to assess the model's performance in interpreting driving scenarios, answering questions, and making decisions. The evaluation uses expert LLMs (such as GPT-3.5) for scoring, ensuring consistency and accuracy in the assessment. ### Experimental Results The paper validates the effectiveness of the proposed methods through a series of experiments: 1. **Perception and Action Prediction**: - The pre-trained LLM-Driver demonstrates significantly better performance in perception tasks (such as detecting the number of vehicles and pedestrians) and action prediction tasks (such as accelerating, braking, and steering) compared to non-pre-trained models and traditional behavioral cloning methods. 2. **Driving QA Evaluation**: - In open-ended QA tasks for driving scenarios, the pre-trained LLM-Driver achieves higher scores, indicating its stronger ability to interpret driving scenarios and answer related questions. ### Conclusion By integrating object-level vector modalities with pre-trained LLMs, this paper proposes a new multimodal architecture that effectively enhances the interpretability and decision-making capabilities of autonomous driving systems. Through the construction of a large-scale driving scenario QA dataset and the introduction of new evaluation metrics, the paper provides important baselines and references for research in the field of autonomous driving.