OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

Shihao Wang,Zhiding Yu,Xiaohui Jiang,Shiyi Lan,Min Shi,Nadine Chang,Jan Kautz,Ying Li,Jose M. Alvarez
2024-05-03
Abstract:The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily proposes a novel solution to address two major challenges in the field of autonomous driving: 1. **3D Spatial Understanding and Planning**: Existing large language models (LLMs) perform well in 2D scenarios but face challenges in 3D space, especially when a comprehensive understanding of the 3D environment is required. To overcome this difficulty, the paper introduces the OmniDrive framework, which includes a new model architecture—OmniDrive-Agent, and an evaluation benchmark—OmniDrive-nuScenes. - **OmniDrive-Agent**: This is a new 3D multimodal LLM based on the Q-Former architecture. It enhances and compresses visual representations into 3D space using sparse queries and inputs them into the LLM. This design allows the model to simultaneously encode dynamic objects and static map elements (such as traffic lanes), providing a simplified world model for perception-action alignment. - **OmniDrive-nuScenes**: This is a comprehensive visual question answering (VQA) task benchmark covering various aspects such as scene description, traffic rules, 3D localization, counterfactual reasoning, decision making, and planning. Notably, this benchmark introduces counterfactual reasoning settings that simulate decisions and trajectories to infer potential consequences, aiding the model's understanding and planning capabilities in complex 3D scenarios. 2. **Processing Multi-view High-resolution Video Inputs**: To efficiently handle multi-view high-resolution video inputs, the paper designs an efficient MLLM architecture capable of processing such data while maintaining computational efficiency. Traditional 2D MLLM architectures (like LLaVA-1.5) are limited by image encoder resolution and LLM token sequence length, making them difficult to apply directly to autonomous driving scenarios. OmniDrive-Agent addresses this issue by using the Q-Former architecture to compress visual information into sparse queries. In summary, OmniDrive aims to provide a comprehensive framework for end-to-end autonomous driving, featuring excellent 3D reasoning and planning capabilities, as well as a more challenging evaluation benchmark that goes beyond single expert trajectories.