Abstract:The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.

What problem does this paper attempt to address?

The paper primarily proposes a novel solution to address two major challenges in the field of autonomous driving: 1. **3D Spatial Understanding and Planning**: Existing large language models (LLMs) perform well in 2D scenarios but face challenges in 3D space, especially when a comprehensive understanding of the 3D environment is required. To overcome this difficulty, the paper introduces the OmniDrive framework, which includes a new model architecture—OmniDrive-Agent, and an evaluation benchmark—OmniDrive-nuScenes. - **OmniDrive-Agent**: This is a new 3D multimodal LLM based on the Q-Former architecture. It enhances and compresses visual representations into 3D space using sparse queries and inputs them into the LLM. This design allows the model to simultaneously encode dynamic objects and static map elements (such as traffic lanes), providing a simplified world model for perception-action alignment. - **OmniDrive-nuScenes**: This is a comprehensive visual question answering (VQA) task benchmark covering various aspects such as scene description, traffic rules, 3D localization, counterfactual reasoning, decision making, and planning. Notably, this benchmark introduces counterfactual reasoning settings that simulate decisions and trajectories to infer potential consequences, aiding the model's understanding and planning capabilities in complex 3D scenarios. 2. **Processing Multi-view High-resolution Video Inputs**: To efficiently handle multi-view high-resolution video inputs, the paper designs an efficient MLLM architecture capable of processing such data while maintaining computational efficiency. Traditional 2D MLLM architectures (like LLaVA-1.5) are limited by image encoder resolution and LLM token sequence length, making them difficult to apply directly to autonomous driving scenarios. OmniDrive-Agent addresses this issue by using the Q-Former architecture to compress visual information into sparse queries. In summary, OmniDrive aims to provide a comprehensive framework for end-to-end autonomous driving, featuring excellent 3D reasoning and planning capabilities, as well as a more challenging evaluation benchmark that goes beyond single expert trajectories.

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

A Language Agent for Autonomous Driving

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Making Large Language Models Better Planners with Reasoning-Decision Alignment

DriveLM: Driving with Graph Visual Question Answering

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models

Probing Multimodal LLMs as World Models for Driving

VLP: Vision Language Planning for Autonomous Driving

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

Planning-oriented Autonomous Driving