Abstract:Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs' ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22\%. The F1-score in ROUGE-L has increased by 39.91\%, the Precision has increased by 55.80\%, and the Recall has increased by 13.74\%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.

What problem does this paper attempt to address?

The key problem that this paper attempts to solve is how to improve the safety and reliability of autonomous driving systems when dealing with extreme situations (corner cases). Specifically, the paper focuses on the challenges of Vision - Language Models (VLMs) in understanding autonomous driving scenarios, especially the phenomena of hallucination and the lack of real - world context. These problems lead to poor performance of VLMs in key driving scenarios, thus affecting the overall performance of the system. ### Main contributions of the paper To solve the above - mentioned problems, the authors propose a new framework named RAC3, which aims to enhance the ability of VLMs to handle extreme cases in the following ways: 1. **Introducing Retrieval - Augmented Generation (RAG)**: By dynamically integrating context - specific external knowledge, the hallucination phenomenon is alleviated. 2. **Cross - modal alignment fine - tuning algorithm**: Using contrastive learning to embed image - text pairs into a unified semantic space to achieve robust retrieval of similar scenarios. 3. **Combination of inputting new extreme - case images and retrieved images**: By combining new images with similar images retrieved from the database and inputting them into the VLM, and designing corresponding prompt engineering methods, the model's ability to simultaneously understand the two images is improved. 4. **Experimental verification**: Through extensive experimental evaluations of the effectiveness of the RAC3 framework, its advantages in improving semantic alignment, reducing hallucination, and improving performance metrics such as cosine similarity and ROUGE - L scores are demonstrated. ### Core technological innovations - **Cross - modal alignment fine - tuning**: Using contrastive learning and negative sample mining techniques to ensure close alignment of image and text embeddings in the shared semantic space. - **Retrieval - Augmented Generation**: By introducing external knowledge through the retrieval mechanism, the model output is more in line with the actual situation, reducing the hallucination phenomenon. - **Efficient deployment**: The proposed framework significantly improves the performance of small VLMs without fine - tuning, reducing the demand for computational resources and facilitating in - vehicle applications. ### Experimental results The experimental results show that the RAC3 framework can significantly improve the understanding and reasoning abilities of VLMs in extreme cases. For example, on the LLaVA - v1.6 - 34B model, the cosine similarity between the generated text and the reference text is increased by 5.22%, the F1 score of ROUGE - L is increased by 39.91%, the precision is increased by 55.80%, and the recall is increased by 13.74%. ### Conclusion The RAC3 framework shows great potential for retrieval - enhanced VLMs in improving the robustness and safety of autonomous driving systems, especially in complex environments. By effectively integrating external knowledge, RAC3 not only improves the accuracy of the model, but also enhances its interpretability and reliability, providing strong support for achieving high - level autonomous driving.

RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Empowering Corner Case Detection in Autonomous Vehicles with Multimodal Large Language Models

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation

Realistic Corner Case Generation for Autonomous Vehicles with Multimodal Large Language Model

DriveLM: Driving with Graph Visual Question Answering

VLP: Vision Language Planning for Autonomous Driving

LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement

Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

Enhancing Autonomous Vehicle Training with Language Model Integration and Critical Scenario Generation

Multimodal Large Language Model Driven Scenario Testing for Autonomous Vehicles

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

How to Build a Pre-trained Multimodal model for Simultaneously Chatting and Decision-making?

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles