Abstract:In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbf{VLM-RL}, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5\% reduction in collision rate, a 104.6\% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements. The demo video and code can be accessed at: <a class="link-external link-https" href="https://zilin-huang.github.io/VLM-RL-website" rel="external noopener nofollow">this https URL</a>.

VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Vision Language Models in Autonomous Driving: A Survey and Outlook

Empowering Autonomous Driving with Large Language Models: A Safety Perspective

WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

VLP: Vision Language Planning for Autonomous Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

DriveLM: Driving with Graph Visual Question Answering

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

Embodied Understanding of Driving Scenarios

DriveLLaVA: Human-Level Behavior Decisions via Vision Language Model