Empowering Corner Case Detection in Autonomous Vehicles with Multimodal Large Language Models

Tianqi Liu,Yanjun Qin,Shanghang Zhang,Xiaoming Tao
DOI: https://doi.org/10.1109/lsp.2024.3495557
2024-01-01
IEEE Signal Processing Letters
Abstract:Object detection powered by deep learning is an essential component in the realm of self-driving vehicles. However, the model may be affected by corner cases, which are rare or unusual objects and scenarios, and can significantly impact the reliability of object detection systems. In this paper, we applied a Multimodal Large Language Model (MLLM) to address the challenge of corner cases in autonomous driving systems. The MLLM consists of an image encoder, a text tokenizer, a modal alignment layer, and a pre-trained large language model, enabling the model to understand multimodal semantic information. We added text descriptions on the basis of corner case dataset CODA and constructed the CODA-REC dataset. This dataset is then used to perform instruction fine-tuning on the MLLM to adapt it to the object detection task. The proposed method leverages the extensive knowledge and zero-shot learning capabilities of LLMs to enhance the semantic understanding of text and images, enabling the detection and appropriate response to corner cases that were previously difficult to handle. The experimental results show that MLLM achieved better performance than baseline models, with an improvement of about 10% in mAR and mAP metrics compared to most closed-set models, and an improvement of 10% mAP compared to open set models. We hope that our work can inspire the application of MLLMs in the field of autonomous driving, contributing to more advanced intelligent transportation systems.
What problem does this paper attempt to address?