Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Yiqun Duan,Qiang Zhang,Renjing Xu
2024-07-29
Abstract:The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in end-to-end autonomous driving: 1. **Multimodal Perception Fusion**: - Existing research mainly focuses on converting the output of perception models into pure language input, which may lead to descriptive bias. This paper proposes a new framework that mitigates this bias intrinsically by fusing visual and LiDAR sensor inputs into learnable multimodal tokens. 2. **Combining End-to-End Learning with Language Models**: - Directly allowing large language models (LLMs) to drive poses risks and uncertainties. This paper explores a hybrid setup where LLMs assist the driving model in correcting errors and handling complex scenarios rather than directly controlling the vehicle. 3. **Improving Driving Performance**: - By introducing multimodal perception fusion and the combination with language models, this paper aims to enhance the performance of autonomous driving systems, especially in complex and long-tail scenarios. Experimental results show that this method achieves high driving scores (49.21%) and route completion rates (91.34%) in the CARLA simulator, comparable to state-of-the-art driving models. ### Main Contributions 1. **Multimodal Perception Fusion**: - Proposes a method to fuse visual and LiDAR features into joint feature token representations, enhancing end-to-end autonomous driving through multimodal prompts. 2. **Simple Prompt Construction Method**: - Designs a simple and effective prompt construction method that continuously integrates unified observations, current states, trajectories, and control actions into the prompts. 3. **Reward-Based Reinforcement Learning Supervision**: - Introduces reward-based reinforcement learning supervision through language prompts in autonomous driving scenarios to further improve model performance. ### Method Overview 1. **Multimodal Joint Token Encoder**: - Uses two different CNN branches to extract shallow features from image and LiDAR inputs, then fuses geometric features through a cross-modal self-attention mechanism. Finally, maps the features into a unified semantic token space via a Swin Transformer encoder. 2. **Prompt Construction**: - Combines multimodal tokens, the vehicle's current state, and driving task commands sequentially into prompts to guide the language model in predicting perception results and driving actions. Three task modes are designed: directly generating perception observations and driving actions, re-querying LLMs to resolve conflicts with the safety controller, and correcting driving errors based on multimodal tokens and driving outputs. 3. **Re-query Mechanism**: - Due to the uncertainty of autoregressive models, when there are conflicts between predicted waypoints and control actions, the system triggers a re-query mechanism, allowing the language model to "think twice." 4. **Reward-Based Tuning**: - Introduces reinforcement learning supervision to autoregressive learning through the Proximal Policy Optimization (PPO) algorithm and a masking mechanism to improve the model's prediction accuracy. ### Experimental Results - **Experimental Environment**: Conducted end-to-end driving task experiments using CARLA simulator 0.9.14, covering various weather and lighting conditions. - **Data Collection**: Collected 1000 routes in 8 official town maps, with an average length of 400 meters. - **Evaluation Benchmark**: Used the LongSet6 benchmark for offline evaluation, with main metrics including Route Completion (RC), Infraction Score (IS), and Driving Score (DS). - **Performance Comparison**: Compared to existing SOTA methods, the proposed method shows excellent performance in route completion rate and driving score, approaching expert levels. Through these innovations, this paper provides a new solution for end-to-end autonomous driving, particularly in the application of multimodal perception fusion and language models.