Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Akshay Gopalkrishnan,Ross Greer,Mohan Trivedi
2024-05-09
Abstract:Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have become prominent in autonomous driving research, as these models can provide interpretable textual reasoning and responses for end-to-end autonomous driving safety tasks using traffic scene images and other data modalities. However, current approaches to these systems use expensive large language model (LLM) backbones and image encoders, making such systems unsuitable for real-time autonomous driving systems where tight memory constraints exist and fast inference time is necessary. To address these previous issues, we develop EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving. In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations, while also achieving higher CIDEr and ROUGE-L scores than the existing baseline on the DriveLM dataset. EM-VLM4AD also exhibits the ability to extract relevant information from traffic views related to prompts and can answer questions for various autonomous driving subtasks. We release our code to train and evaluate our model at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the inefficiency, high computational cost, and large memory footprint of existing vision - language models (VLMs) and multi - modal language models (MMLMs) for autonomous driving in real - time applications. Specifically, current methods usually rely on large - language models (LLMs) and image encoders. These models have a large number of parameters and high computational complexity, which makes them unsuitable for real - time inference in resource - constrained autonomous driving systems. To solve these problems, the author has developed an efficient, lightweight, multi - frame vision - language model named EM - VLM4AD, aiming to provide more efficient visual question - answering (VQA) capabilities for autonomous driving. Compared with existing methods, EM - VLM4AD has the following advantages: 1. **Significantly reduce memory and computational requirements**: The number of parameters in EM - VLM4AD is less than one - tenth of that in existing models, and the number of floating - point operations is also greatly reduced. 2. **Higher performance**: Although the model size is smaller, it outperforms existing baseline models in CIDEr and ROUGE scores on the DriveLM dataset. 3. **Multi - frame processing ability**: It can extract relevant information from traffic scenes from multiple perspectives and answer questions for various subtasks related to autonomous driving. Through these improvements, EM - VLM4AD can not only achieve faster inference speed in resource - constrained environments but also maintain high accuracy, making it more suitable for application in actual autonomous driving systems.