RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

Yujin Wang,Quanfeng Liu,Jiaqi Fan,Jinlong Hong,Hongqing Chu,Mengjian Tian,Bingzhao Gao,Hong Chen
2024-12-15
Abstract:Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs' ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22\%. The F1-score in ROUGE-L has increased by 39.91\%, the Precision has increased by 55.80\%, and the Recall has increased by 13.74\%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is how to improve the safety and reliability of autonomous driving systems when dealing with extreme situations (corner cases). Specifically, the paper focuses on the challenges of Vision - Language Models (VLMs) in understanding autonomous driving scenarios, especially the phenomena of hallucination and the lack of real - world context. These problems lead to poor performance of VLMs in key driving scenarios, thus affecting the overall performance of the system. ### Main contributions of the paper To solve the above - mentioned problems, the authors propose a new framework named RAC3, which aims to enhance the ability of VLMs to handle extreme cases in the following ways: 1. **Introducing Retrieval - Augmented Generation (RAG)**: By dynamically integrating context - specific external knowledge, the hallucination phenomenon is alleviated. 2. **Cross - modal alignment fine - tuning algorithm**: Using contrastive learning to embed image - text pairs into a unified semantic space to achieve robust retrieval of similar scenarios. 3. **Combination of inputting new extreme - case images and retrieved images**: By combining new images with similar images retrieved from the database and inputting them into the VLM, and designing corresponding prompt engineering methods, the model's ability to simultaneously understand the two images is improved. 4. **Experimental verification**: Through extensive experimental evaluations of the effectiveness of the RAC3 framework, its advantages in improving semantic alignment, reducing hallucination, and improving performance metrics such as cosine similarity and ROUGE - L scores are demonstrated. ### Core technological innovations - **Cross - modal alignment fine - tuning**: Using contrastive learning and negative sample mining techniques to ensure close alignment of image and text embeddings in the shared semantic space. - **Retrieval - Augmented Generation**: By introducing external knowledge through the retrieval mechanism, the model output is more in line with the actual situation, reducing the hallucination phenomenon. - **Efficient deployment**: The proposed framework significantly improves the performance of small VLMs without fine - tuning, reducing the demand for computational resources and facilitating in - vehicle applications. ### Experimental results The experimental results show that the RAC3 framework can significantly improve the understanding and reasoning abilities of VLMs in extreme cases. For example, on the LLaVA - v1.6 - 34B model, the cosine similarity between the generated text and the reference text is increased by 5.22%, the F1 score of ROUGE - L is increased by 39.91%, the precision is increased by 55.80%, and the recall is increased by 13.74%. ### Conclusion The RAC3 framework shows great potential for retrieval - enhanced VLMs in improving the robustness and safety of autonomous driving systems, especially in complex environments. By effectively integrating external knowledge, RAC3 not only improves the accuracy of the model, but also enhances its interpretability and reliability, providing strong support for achieving high - level autonomous driving.