Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Xinpeng Ding,Jinahua Han,Hang Xu,Xiaodan Liang,Wei Zhang,Xiaomeng Li
2024-01-02
Abstract:The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for large language models. BEV-InMLLM integrates multi-view, spatial awareness, and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g. around 9% improvement on various tasks. We plan to release our NuInstruct for future research development.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address two main issues present in current language-based autonomous driving research: 1. **Task Partiality**: Existing benchmark datasets only cover a portion of the tasks involved in autonomous driving, whereas autonomous driving is actually composed of a series of interdependent tasks, and the absence of any one task can affect the overall functionality of the system. For example, a lack of accurate perception can make it difficult to make reliable predictions. 2. **Information Incompleteness**: The information utilized by existing methods to perform these tasks is often incomplete. Specifically, existing datasets typically contain only single-view images, without considering temporal and multi-view information. However, safe driving decisions require a comprehensive understanding of the environment, such as only focusing on the front may overlook a vehicle overtaking from the left. To address the above issues, the authors propose NuInstruct, a new dataset containing 91K multi-view video-instruction-response pairs, covering 17 sub-tasks. NuInstruct provides more complex tasks than existing benchmark datasets, requiring extensive information such as multi-view, temporal, distance, etc. To generate the NuInstruct dataset, the authors introduced an SQL-based method to automatically create instruction-response pairs. Additionally, to tackle the challenging tasks posed by NuInstruct, the authors extended existing Multimodal Large Language Models (MLLMs) to receive more comprehensive information. They proposed the BEV-InMLLM model, which integrates instruction-aware Bird's Eye View (BEV) features with existing MLLMs to enhance perception and decision-making in autonomous driving. The BEV-InMLLM model effectively acquires BEV features aligned with language features through a plug-in BEV injection module. This approach is more efficient than training a BEV extractor from scratch.