MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

Can Cui,Tongxi Zhou,Xu Cao,Ziran Wang,Wenqian Ye,Yunsheng Ma,Kun Tang,J. Rehg,Chao Zheng,Kaizhao Liang,Zhipeng Cao
DOI: https://doi.org/10.1109/CVPR52733.2024.02061
2024-06-16
Computer Vision and Pattern Recognition
Abstract:Vision-language generative AI has demonstrated re-markable promise for empowering cross-modal scene understanding of autonomous driving and high-definition (HD) map systems. However, current benchmark datasets lack multi-modal point cloud, image, and language data pairs. Recent approaches utilize visual instruction learning and cross-modal prompt engineering to expand vision-language models into this domain. In this paper, we pro-pose a new vision-language benchmark that can be used to finetune traffic and HD map domain-specific foundation models. Specifically, we annotate and leverage large-scale, broad-coverage traffic and map data extracted from huge HD map annotations, and use CLIP and LLaMA-2 / Vi-cuna to finetune a baseline model with instruction-following data. Our experimental results across various algorithms reveal that while visual instruction-tuning large language models (LLMs) can effectively learn meaningful represen-tations from MAPLM-QA, there remains significant room for further advancements. To facilitate applying LLMs and multi-modal data into self-driving research, we will release our visual-language QA data, and the baseline models at GitHub.com/LLVM-AD/MAPLM.
Engineering,Computer Science
What problem does this paper attempt to address?