Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Akshay Gopalkrishnan,Ross Greer,Mohan Trivedi

2024-05-09

Abstract:Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have become prominent in autonomous driving research, as these models can provide interpretable textual reasoning and responses for end-to-end autonomous driving safety tasks using traffic scene images and other data modalities. However, current approaches to these systems use expensive large language model (LLM) backbones and image encoders, making such systems unsuitable for real-time autonomous driving systems where tight memory constraints exist and fast inference time is necessary. To address these previous issues, we develop EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving. In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations, while also achieving higher CIDEr and ROUGE-L scores than the existing baseline on the DriveLM dataset. EM-VLM4AD also exhibits the ability to extract relevant information from traffic views related to prompts and can answer questions for various autonomous driving subtasks. We release our code to train and evaluate our model at

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is the inefficiency, high computational cost, and large memory footprint of existing vision - language models (VLMs) and multi - modal language models (MMLMs) for autonomous driving in real - time applications. Specifically, current methods usually rely on large - language models (LLMs) and image encoders. These models have a large number of parameters and high computational complexity, which makes them unsuitable for real - time inference in resource - constrained autonomous driving systems. To solve these problems, the author has developed an efficient, lightweight, multi - frame vision - language model named EM - VLM4AD, aiming to provide more efficient visual question - answering (VQA) capabilities for autonomous driving. Compared with existing methods, EM - VLM4AD has the following advantages: 1. **Significantly reduce memory and computational requirements**: The number of parameters in EM - VLM4AD is less than one - tenth of that in existing models, and the number of floating - point operations is also greatly reduced. 2. **Higher performance**: Although the model size is smaller, it outperforms existing baseline models in CIDEr and ROUGE scores on the DriveLM dataset. 3. **Multi - frame processing ability**: It can extract relevant information from traffic scenes from multiple perspectives and answer questions for various subtasks related to autonomous driving. Through these improvements, EM - VLM4AD can not only achieve faster inference speed in resource - constrained environments but also maintain high accuracy, making it more suitable for application in actual autonomous driving systems.

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

DriveLM: Driving with Graph Visual Question Answering

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Semantic Understanding of Traffic Scenes with Large Vision Language Models

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Vision Language Models in Autonomous Driving: A Survey and Outlook

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

EMMA: End-to-End Multimodal Model for Autonomous Driving

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles