Probing Multimodal LLMs as World Models for Driving

Shiva Sreeram,Tsun-Hsuan Wang,Alaa Maalouf,Guy Rosman,Sertac Karaman,Daniela Rus
2024-10-26
Abstract:We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the application capabilities of Multimodal Large Language Models (MLLMs) in autonomous driving, especially their capabilities as world models. Specifically, through a series of experiments, the paper evaluates the MLLMs' ability to understand the environment and make decisions in dynamic driving scenarios. The study found that although these models perform well in understanding single images, they have significant deficiencies in synthesizing coherent narratives, understanding their own vehicle dynamics, interacting with other road users, trajectory planning, and open - set scenario reasoning. ### Main research questions: 1. **Can MLLMs be used as world models for driving?** - The paper explores the performance of MLLMs in dynamic driving scenarios through experiments, especially their ability to understand complex, dynamic environments and integrate sequences of visual data in the decision - making process. 2. **What are the specific challenges of MLLMs in dynamic driving scenarios?** - The research reveals the difficulties of MLLMs in the following aspects: - **Own vehicle dynamics**: Identifying whether the vehicle is moving forward or backward, accelerating or decelerating, turning left or right. - **Interaction with other road users**: Detecting fast - moving vehicles and traffic jams. - **Trajectory planning**: Planning driving trajectories to avoid obstacles. - **Open - set scenario reasoning**: Handling unforeseen scenarios, such as suddenly - appearing animals or airplanes landing. ### Experimental methods: - **Data set**: The EVAL - LLM - DRIVE data set was introduced, which contains real - driving videos and diverse scenarios generated by the DRIVE SIM simulator. - **Evaluation metrics**: The performance of MLLMs was evaluated in multiple dimensions, including own vehicle dynamics, interaction with other road users, trajectory planning, and open - set scenario reasoning. - **Experimental design**: Different numbers of video frames (3, 6, 9 frames) were used to test the model's reasoning ability. ### Main findings: - **Limitations of geometric and temporal reasoning**: MLLMs have significant difficulties in understanding continuous visual information to infer motion, especially when identifying vehicle dynamics (such as moving forward or backward). - **Bias problem**: Many models show obvious biases. For example, GPT - 4V almost always predicts that the vehicle is moving forward, even in reverse - driving scenarios. - **Challenges in open - set scenarios**: MLLMs perform poorly when handling unforeseen scenarios (such as suddenly - appearing animals or airplanes landing). ### Conclusion: The paper proves through experiments that although MLLMs perform well in understanding single images, there is still much room for improvement in their performance in dynamic driving scenarios. The study points out the current limitations of MLLMs in geometric and temporal reasoning and emphasizes the directions for future improvement.