Semantic Understanding of Traffic Scenes with Large Vision Language Models

Surendrabikram Thapa,A. L. Abbott,Kuan-Ting Chen,Abhijit Sarkar,Sandesh Jain
DOI: https://doi.org/10.1109/IV55156.2024.10588373
2024-06-02
Abstract:This paper investigates the integration of Large Vision Language Models (LVLMs) with multi-sensor information, including visual and localization data from cameras and LiDAR data to a holistic understanding of traffic videos. Traffic scene understanding is a challenging problem. With complex interaction between the road actors, infrastructure, and traffic rules, it is often difficult to answer questions related to road safety, pedestrian safety, safe maneuvering characteristics, and human factors. Typical processes use a single task-oriented neural network model and combine them through semantic and symbolic reasoning. These processes often suffer from reasoning bias and incompleteness. In recent years, LVLMs have opened new avenues to perceive spatiotemporal information. These models can leverage the large knowledge base from the world and summarize spatiotemporal information effectively. The interactive nature of most of these systems allows humans to directly interact in a visual question-answering mode.In this paper, we have extensively tested the capabilities of such LVLMs to answer key transportation research questions from videos captured through front cameras. We have curated an extensive set of multiple-choice questions to evaluate the performance of these LVLMs. Our results show that LVLMs have abilities to understand various transportation-related aspects to a great extent. Furthermore, we have shown that the addition of supplementary modalities to the VQA settings helps improve the performance of LVLMs. With the addition of 3D trajectories of surrounding objects with the 2D video frames, we observed a significant increase in MCQ performance related to vehicle-to-vehicle interaction tasks. The resources for this paper can be found at https://github.com/sandeshrjain/lvlm-scene
Engineering,Environmental Science,Computer Science
What problem does this paper attempt to address?