Abstract:The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for large language models. BEV-InMLLM integrates multi-view, spatial awareness, and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g. around 9% improvement on various tasks. We plan to release our NuInstruct for future research development.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address two main issues present in current language-based autonomous driving research: 1. **Task Partiality**: Existing benchmark datasets only cover a portion of the tasks involved in autonomous driving, whereas autonomous driving is actually composed of a series of interdependent tasks, and the absence of any one task can affect the overall functionality of the system. For example, a lack of accurate perception can make it difficult to make reliable predictions. 2. **Information Incompleteness**: The information utilized by existing methods to perform these tasks is often incomplete. Specifically, existing datasets typically contain only single-view images, without considering temporal and multi-view information. However, safe driving decisions require a comprehensive understanding of the environment, such as only focusing on the front may overlook a vehicle overtaking from the left. To address the above issues, the authors propose NuInstruct, a new dataset containing 91K multi-view video-instruction-response pairs, covering 17 sub-tasks. NuInstruct provides more complex tasks than existing benchmark datasets, requiring extensive information such as multi-view, temporal, distance, etc. To generate the NuInstruct dataset, the authors introduced an SQL-based method to automatically create instruction-response pairs. Additionally, to tackle the challenging tasks posed by NuInstruct, the authors extended existing Multimodal Large Language Models (MLLMs) to receive more comprehensive information. They proposed the BEV-InMLLM model, which integrates instruction-aware Bird's Eye View (BEV) features with existing MLLMs to enhance perception and decision-making in autonomous driving. The BEV-InMLLM model effectively acquires BEV features aligned with language features through a plug-in BEV injection module. This approach is more efficient than training a BEV extractor from scratch.

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Navigation Instruction Generation with BEV Perception and Large Language Models

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization

A Survey on Multimodal Large Language Models for Autonomous Driving

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

Probing Multimodal LLMs as World Models for Driving

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

Instruct Large Language Models to Drive like Humans

Hierarchical Interpretable Imitation Learning for End-to-End Autonomous Driving