Abstract:In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance of autonomous vehicles in closed - loop driving when only using camera inputs. Specifically, the paper introduces CarLLaVA, a vision - language model (VLM) specifically designed for autonomous driving. CarLLaVA aims to achieve advanced closed - loop driving performance relying solely on camera data without complex or expensive labels by leveraging pre - trained visual encoders and large - language - model architectures. ### Main Problems and Solutions 1. **Reducing Dependence on Expensive Sensors**: - The paper points out that most existing high - performance autonomous driving systems rely on expensive LiDAR sensors. CarLLaVA, on the other hand, relies entirely on camera inputs, eliminating the need for these expensive sensors, thereby reducing the cost and complexity of the system. 2. **Improving Driving Performance**: - CarLLaVA achieves a significant performance improvement in closed - loop driving tasks by using pre - trained visual encoders and large - language models (such as LLaMA). Specifically, it ranked first in the CARLA Autonomous Driving Challenge 2.0, with a 458% improvement over the previous best method and a 32.6% improvement over the best submission during the same period. 3. **Improving Output Representations**: - CarLLaVA adopts a semi - decoupled output representation method, combining waypoints under temporal conditions and path waypoints under spatial conditions. This representation method performs well in both lateral and longitudinal control, especially when turning and avoiding obstacles. 4. **Efficient Training Strategies**: - The paper proposes an efficient training strategy to reduce training time by creating data buckets containing interesting samples. This method avoids wasting computational resources on simple and uninteresting data, thus accelerating the convergence of the model. ### Specific Technical Details - **Input Representations**: - The model inputs include camera images, the next target point, and the vehicle's own speed. To handle high - resolution images, CarLLaVA divides the input image into multiple large blocks, encodes each block independently, and then concatenates them into a large feature map. - **Visual Encoders**: - The visual encoder of LLaVA - NeXT is used, specifically the CLIPViT - L - 336px model, to capture important details in high - resolution images. Through multi - block encoding and feature concatenation, the model can better identify traffic lights and pedestrians in the distance. - **Output Representations**: - A semi - decoupled output representation is adopted, combining waypoints under temporal conditions and path waypoints under spatial conditions. This helps to provide more precise control in different driving scenarios. - **Training Strategies**: - Create data buckets containing interesting samples to reduce training time. Further optimize the model performance by adjusting the early - stopping threshold. ### Conclusion CarLLaVA achieves a significant performance improvement in the field of autonomous driving by only using camera inputs and pre - trained vision - language models. This method not only reduces the dependence on expensive sensors but also demonstrates the great potential of vision - language models in practical autonomous driving applications.

CarLLaVA: Vision language models for camera-only closed-loop driving

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

DriveLLaVA: Human-Level Behavior Decisions via Vision Language Model

Conditional Vehicle Trajectories Prediction in CARLA Urban Environment

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Vision Language Models in Autonomous Driving: A Survey and Outlook

LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation

LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving