CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz,Long Chen,Ana-Maria Marcu,Jan Hünermann,Benoit Hanotte,Alice Karnsund,Jamie Shotton,Elahe Arani,Oleg Sinavski
2024-06-15
Abstract:In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of autonomous vehicles in closed - loop driving when only using camera inputs. Specifically, the paper introduces CarLLaVA, a vision - language model (VLM) specifically designed for autonomous driving. CarLLaVA aims to achieve advanced closed - loop driving performance relying solely on camera data without complex or expensive labels by leveraging pre - trained visual encoders and large - language - model architectures. ### Main Problems and Solutions 1. **Reducing Dependence on Expensive Sensors**: - The paper points out that most existing high - performance autonomous driving systems rely on expensive LiDAR sensors. CarLLaVA, on the other hand, relies entirely on camera inputs, eliminating the need for these expensive sensors, thereby reducing the cost and complexity of the system. 2. **Improving Driving Performance**: - CarLLaVA achieves a significant performance improvement in closed - loop driving tasks by using pre - trained visual encoders and large - language models (such as LLaMA). Specifically, it ranked first in the CARLA Autonomous Driving Challenge 2.0, with a 458% improvement over the previous best method and a 32.6% improvement over the best submission during the same period. 3. **Improving Output Representations**: - CarLLaVA adopts a semi - decoupled output representation method, combining waypoints under temporal conditions and path waypoints under spatial conditions. This representation method performs well in both lateral and longitudinal control, especially when turning and avoiding obstacles. 4. **Efficient Training Strategies**: - The paper proposes an efficient training strategy to reduce training time by creating data buckets containing interesting samples. This method avoids wasting computational resources on simple and uninteresting data, thus accelerating the convergence of the model. ### Specific Technical Details - **Input Representations**: - The model inputs include camera images, the next target point, and the vehicle's own speed. To handle high - resolution images, CarLLaVA divides the input image into multiple large blocks, encodes each block independently, and then concatenates them into a large feature map. - **Visual Encoders**: - The visual encoder of LLaVA - NeXT is used, specifically the CLIPViT - L - 336px model, to capture important details in high - resolution images. Through multi - block encoding and feature concatenation, the model can better identify traffic lights and pedestrians in the distance. - **Output Representations**: - A semi - decoupled output representation is adopted, combining waypoints under temporal conditions and path waypoints under spatial conditions. This helps to provide more precise control in different driving scenarios. - **Training Strategies**: - Create data buckets containing interesting samples to reduce training time. Further optimize the model performance by adjusting the early - stopping threshold. ### Conclusion CarLLaVA achieves a significant performance improvement in the field of autonomous driving by only using camera inputs and pre - trained vision - language models. This method not only reduces the dependence on expensive sensors but also demonstrates the great potential of vision - language models in practical autonomous driving applications.