Abstract:This paper investigates the integration of Large Vision Language Models (LVLMs) with multi-sensor information, including visual and localization data from cameras and LiDAR data to a holistic understanding of traffic videos. Traffic scene understanding is a challenging problem. With complex interaction between the road actors, infrastructure, and traffic rules, it is often difficult to answer questions related to road safety, pedestrian safety, safe maneuvering characteristics, and human factors. Typical processes use a single task-oriented neural network model and combine them through semantic and symbolic reasoning. These processes often suffer from reasoning bias and incompleteness. In recent years, LVLMs have opened new avenues to perceive spatiotemporal information. These models can leverage the large knowledge base from the world and summarize spatiotemporal information effectively. The interactive nature of most of these systems allows humans to directly interact in a visual question-answering mode.In this paper, we have extensively tested the capabilities of such LVLMs to answer key transportation research questions from videos captured through front cameras. We have curated an extensive set of multiple-choice questions to evaluate the performance of these LVLMs. Our results show that LVLMs have abilities to understand various transportation-related aspects to a great extent. Furthermore, we have shown that the addition of supplementary modalities to the VQA settings helps improve the performance of LVLMs. With the addition of 3D trajectories of surrounding objects with the 2D video frames, we observed a significant increase in MCQ performance related to vehicle-to-vehicle interaction tasks. The resources for this paper can be found at https://github.com/sandeshrjain/lvlm-scene

Semantic Scene Understanding with Large Language Models on Unmanned Aerial Vehicles

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Large Language Models for UAVs: Current State and Pathways to the Future

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles

Large-Scale Autonomous Flight With Real-Time Semantic SLAM Under Dense Forest Canopy

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Integrating Large Language Models for UAV Control in Simulated Environments: A Modular Interaction Approach

Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles with Label Propagation for Cross-Domain Adaptation

Leveraging Large Language Models for Robot 3D Scene Understanding

S3M: Semantic Segmentation Sparse Mapping for UAVs with RGB-D Camera

Semantic Understanding of Traffic Scenes with Large Vision Language Models

Multimodal Virtual Semantic Communication for Tiny-Machine-Learning-Based UAV Task Execution

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Semantic Segmentation of Unmanned Aerial Vehicle Remote Sensing Images using SegFormer

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

AerialVLN: Vision-and-Language Navigation for UAVs

Semantic Importance-Aware Communications with Semantic Correction Using Large Language Models

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing