Abstract:This paper investigates the integration of Large Vision Language Models (LVLMs) with multi-sensor information, including visual and localization data from cameras and LiDAR data to a holistic understanding of traffic videos. Traffic scene understanding is a challenging problem. With complex interaction between the road actors, infrastructure, and traffic rules, it is often difficult to answer questions related to road safety, pedestrian safety, safe maneuvering characteristics, and human factors. Typical processes use a single task-oriented neural network model and combine them through semantic and symbolic reasoning. These processes often suffer from reasoning bias and incompleteness. In recent years, LVLMs have opened new avenues to perceive spatiotemporal information. These models can leverage the large knowledge base from the world and summarize spatiotemporal information effectively. The interactive nature of most of these systems allows humans to directly interact in a visual question-answering mode.In this paper, we have extensively tested the capabilities of such LVLMs to answer key transportation research questions from videos captured through front cameras. We have curated an extensive set of multiple-choice questions to evaluate the performance of these LVLMs. Our results show that LVLMs have abilities to understand various transportation-related aspects to a great extent. Furthermore, we have shown that the addition of supplementary modalities to the VQA settings helps improve the performance of LVLMs. With the addition of 3D trajectories of surrounding objects with the 2D video frames, we observed a significant increase in MCQ performance related to vehicle-to-vehicle interaction tasks. The resources for this paper can be found at https://github.com/sandeshrjain/lvlm-scene

Semantic Understanding of Traffic Scenes with Large Vision Language Models

Semantic perception of curbs beyond traversability for real-world navigation assistance systems

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

DriveLM: Driving with Graph Visual Question Answering

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks

A Survey on Multimodal Large Language Models for Autonomous Driving

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Understanding Information Storage and Transfer in Multi-modal Large Language Models

Instance-Level Semantic Maps for Vision Language Navigation

Vision Language Models in Autonomous Driving: A Survey and Outlook

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases