Abstract:Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address two key issues in the field of autonomous driving: 1. **Interpretability**: - Modern autonomous driving systems are often viewed as "black boxes," making it particularly challenging to endow them with out-of-distribution (OOD) reasoning capabilities and interpretability during the decision-making process. The paper proposes a novel approach that enhances system interpretability by integrating object-level vector modalities with pre-trained language models (LLMs). 2. **Decision-Making and Action Generation**: - Autonomous driving systems need to make reasonable decisions and generate corresponding driving actions in complex driving environments. The paper introduces a new multimodal LLM architecture that can directly interpret and reason about complex driving situations and generate appropriate driving actions. ### Solutions To achieve the above goals, the paper proposes the following innovative methods and techniques: 1. **Novel Multimodal LLM Architecture**: - Integrates object-level vector modalities (such as the positions and speeds of vehicles and pedestrians) with pre-trained LLMs. Through a two-stage pre-training and fine-tuning process, the model can understand and handle complex information in driving scenarios. 2. **Large-Scale Driving Scenario QA Dataset**: - Constructs a dataset containing 160,000 QA pairs based on 10,000 driving scenarios. These QA pairs consist of high-quality control commands generated by reinforcement learning (RL) agents and QA pairs generated by teacher LLMs (such as GPT-3.5). 3. **Structured Language Generator**: - Uses a structured language generator (lanGen) to convert vector representations into human-readable language descriptions, aligning vector modalities with language modalities, enabling LLMs to better understand driving scenarios. 4. **Novel Evaluation Metrics**: - Introduces a new evaluation metric (Driving QA) to assess the model's performance in interpreting driving scenarios, answering questions, and making decisions. The evaluation uses expert LLMs (such as GPT-3.5) for scoring, ensuring consistency and accuracy in the assessment. ### Experimental Results The paper validates the effectiveness of the proposed methods through a series of experiments: 1. **Perception and Action Prediction**: - The pre-trained LLM-Driver demonstrates significantly better performance in perception tasks (such as detecting the number of vehicles and pedestrians) and action prediction tasks (such as accelerating, braking, and steering) compared to non-pre-trained models and traditional behavioral cloning methods. 2. **Driving QA Evaluation**: - In open-ended QA tasks for driving scenarios, the pre-trained LLM-Driver achieves higher scores, indicating its stronger ability to interpret driving scenarios and answer related questions. ### Conclusion By integrating object-level vector modalities with pre-trained LLMs, this paper proposes a new multimodal architecture that effectively enhances the interpretability and decision-making capabilities of autonomous driving systems. Through the construction of a large-scale driving scenario QA dataset and the introduction of new evaluation metrics, the paper provides important baselines and references for research in the field of autonomous driving.

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Probing Multimodal LLMs as World Models for Driving

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

DriveLM: Driving with Graph Visual Question Answering

Personalized Autonomous Driving with Large Language Models: Field Experiments

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

A Survey on Multimodal Large Language Models for Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles

Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

A Language Agent for Autonomous Driving

Human-Centric Autonomous Systems with LLMs for User Command Reasoning