Abstract:Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-toend autonomous driving applications in the real world.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "DriveMM: All - in - One Large Multimodal Model for Autonomous Driving" aims to solve two main problems existing in large multimodal models (LMMs) in the current autonomous driving field: 1. **Limitations of single - dataset and specific - task**: - Current data - driven autonomous driving methods often focus on a single dataset and specific tasks, ignoring the overall and generalization capabilities of the model. For example, some specialized models (such as CODA - LM, MAPLM, etc.) perform well on specific tasks but are insufficient when dealing with complex and diverse real - world scenarios. - Such limitations lead to poor performance of the model when facing new datasets or unseen tasks. 2. **Lack of comprehensive evaluation benchmarks**: - Currently, there is a lack of a comprehensive evaluation benchmark to fully evaluate the performance of autonomous driving LMMs. Existing evaluations are usually limited to specific datasets and tasks and cannot fully reflect the true capabilities of the model. To solve these problems, the paper proposes DriveMM, which is a general - purpose large multimodal model designed to handle multiple data inputs (such as images and multi - view videos) and perform a wide range of autonomous driving tasks (including perception, prediction, and planning). By using the curriculum - learning method for pre - training and fine - tuning, DriveMM performs well in multiple public benchmark tests and also shows strong generalization ability in zero - shot learning tasks. ### Main contributions 1. **Proposed a new type of general - purpose large multimodal model**: - DriveMM has the ability to handle various autonomous driving tasks and generalize to new datasets. 2. **Introduced a comprehensive evaluation benchmark**: - It includes six public datasets, four input types, and thirteen challenging tasks. This is the first time that multiple benchmarks have been used to evaluate autonomous driving LMMs. 3. **Adopted the curriculum - learning method for pre - training and fine - tuning**: - By gradually increasing the complexity of data and tasks, DriveMM performs well on all evaluation benchmarks and consistently outperforms models trained on a single dataset on all evaluation benchmarks. ### Summary The paper solves the limitations of existing autonomous driving LMMs in handling complex and diverse tasks by proposing DriveMM and provides a comprehensive evaluation benchmark to verify the generalization ability and overall performance of the model. This provides strong support for future end - to - end autonomous driving applications.

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

A Survey on Multimodal Large Language Models for Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Probing Multimodal LLMs as World Models for Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

EMMA: End-to-End Multimodal Model for Autonomous Driving

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

ADriver-I: A General World Model for Autonomous Driving

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

CALMM-Drive: Confidence-Aware Autonomous Driving with Large Multimodal Model

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs

Application of Multimodal Large Language Models in Autonomous Driving