Abstract:The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues in end-to-end autonomous driving: 1. **Multimodal Perception Fusion**: - Existing research mainly focuses on converting the output of perception models into pure language input, which may lead to descriptive bias. This paper proposes a new framework that mitigates this bias intrinsically by fusing visual and LiDAR sensor inputs into learnable multimodal tokens. 2. **Combining End-to-End Learning with Language Models**: - Directly allowing large language models (LLMs) to drive poses risks and uncertainties. This paper explores a hybrid setup where LLMs assist the driving model in correcting errors and handling complex scenarios rather than directly controlling the vehicle. 3. **Improving Driving Performance**: - By introducing multimodal perception fusion and the combination with language models, this paper aims to enhance the performance of autonomous driving systems, especially in complex and long-tail scenarios. Experimental results show that this method achieves high driving scores (49.21%) and route completion rates (91.34%) in the CARLA simulator, comparable to state-of-the-art driving models. ### Main Contributions 1. **Multimodal Perception Fusion**: - Proposes a method to fuse visual and LiDAR features into joint feature token representations, enhancing end-to-end autonomous driving through multimodal prompts. 2. **Simple Prompt Construction Method**: - Designs a simple and effective prompt construction method that continuously integrates unified observations, current states, trajectories, and control actions into the prompts. 3. **Reward-Based Reinforcement Learning Supervision**: - Introduces reward-based reinforcement learning supervision through language prompts in autonomous driving scenarios to further improve model performance. ### Method Overview 1. **Multimodal Joint Token Encoder**: - Uses two different CNN branches to extract shallow features from image and LiDAR inputs, then fuses geometric features through a cross-modal self-attention mechanism. Finally, maps the features into a unified semantic token space via a Swin Transformer encoder. 2. **Prompt Construction**: - Combines multimodal tokens, the vehicle's current state, and driving task commands sequentially into prompts to guide the language model in predicting perception results and driving actions. Three task modes are designed: directly generating perception observations and driving actions, re-querying LLMs to resolve conflicts with the safety controller, and correcting driving errors based on multimodal tokens and driving outputs. 3. **Re-query Mechanism**: - Due to the uncertainty of autoregressive models, when there are conflicts between predicted waypoints and control actions, the system triggers a re-query mechanism, allowing the language model to "think twice." 4. **Reward-Based Tuning**: - Introduces reinforcement learning supervision to autoregressive learning through the Proximal Policy Optimization (PPO) algorithm and a masking mechanism to improve the model's prediction accuracy. ### Experimental Results - **Experimental Environment**: Conducted end-to-end driving task experiments using CARLA simulator 0.9.14, covering various weather and lighting conditions. - **Data Collection**: Collected 1000 routes in 8 official town maps, with an average length of 400 meters. - **Evaluation Benchmark**: Used the LongSet6 benchmark for offline evaluation, with main metrics including Route Completion (RC), Infraction Score (IS), and Driving Score (DS). - **Performance Comparison**: Compared to existing SOTA methods, the proposed method shows excellent performance in route completion rate and driving score, approaching expert levels. Through these innovations, this paper provides a new solution for end-to-end autonomous driving, particularly in the application of multimodal perception fusion and language models.

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles

Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

LLM4RL: Enhancing Reinforcement Learning with Large Language Models

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs

Generating and Evolving Reward Functions for Highway Driving with Large Language Models

Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving

Personalized Autonomous Driving with Large Language Models: Field Experiments

Probing Multimodal LLMs as World Models for Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

Empowering Autonomous Driving with Large Language Models: A Safety Perspective

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Human-Centric Autonomous Systems with LLMs for User Command Reasoning

Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning