Abstract:We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the application capabilities of Multimodal Large Language Models (MLLMs) in autonomous driving, especially their capabilities as world models. Specifically, through a series of experiments, the paper evaluates the MLLMs' ability to understand the environment and make decisions in dynamic driving scenarios. The study found that although these models perform well in understanding single images, they have significant deficiencies in synthesizing coherent narratives, understanding their own vehicle dynamics, interacting with other road users, trajectory planning, and open - set scenario reasoning. ### Main research questions: 1. **Can MLLMs be used as world models for driving?** - The paper explores the performance of MLLMs in dynamic driving scenarios through experiments, especially their ability to understand complex, dynamic environments and integrate sequences of visual data in the decision - making process. 2. **What are the specific challenges of MLLMs in dynamic driving scenarios?** - The research reveals the difficulties of MLLMs in the following aspects: - **Own vehicle dynamics**: Identifying whether the vehicle is moving forward or backward, accelerating or decelerating, turning left or right. - **Interaction with other road users**: Detecting fast - moving vehicles and traffic jams. - **Trajectory planning**: Planning driving trajectories to avoid obstacles. - **Open - set scenario reasoning**: Handling unforeseen scenarios, such as suddenly - appearing animals or airplanes landing. ### Experimental methods: - **Data set**: The EVAL - LLM - DRIVE data set was introduced, which contains real - driving videos and diverse scenarios generated by the DRIVE SIM simulator. - **Evaluation metrics**: The performance of MLLMs was evaluated in multiple dimensions, including own vehicle dynamics, interaction with other road users, trajectory planning, and open - set scenario reasoning. - **Experimental design**: Different numbers of video frames (3, 6, 9 frames) were used to test the model's reasoning ability. ### Main findings: - **Limitations of geometric and temporal reasoning**: MLLMs have significant difficulties in understanding continuous visual information to infer motion, especially when identifying vehicle dynamics (such as moving forward or backward). - **Bias problem**: Many models show obvious biases. For example, GPT - 4V almost always predicts that the vehicle is moving forward, even in reverse - driving scenarios. - **Challenges in open - set scenarios**: MLLMs perform poorly when handling unforeseen scenarios (such as suddenly - appearing animals or airplanes landing). ### Conclusion: The paper proves through experiments that although MLLMs perform well in understanding single images, there is still much room for improvement in their performance in dynamic driving scenarios. The study points out the current limitations of MLLMs in geometric and temporal reasoning and emphasizes the directions for future improvement.

Probing Multimodal LLMs as World Models for Driving

Probing Multimodal LLMs as World Models for Driving

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

A Survey on Multimodal Large Language Models for Autonomous Driving

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Reality Bites: Assessing the Realism of Driving Scenarios with Large Language Models

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

Receive, Reason, and React: Drive as You Say, With Large Language Models in Autonomous Vehicles

Personalized Autonomous Driving with Large Language Models: Field Experiments