Abstract:We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is multi - task processing and end - to - end motion planning in autonomous driving technology. Specifically, the paper introduces a new model named EMMA (End - to - End Multimodal Model for Autonomous Driving), which aims to directly generate various driving - related outputs, such as planned trajectories, perceived objects, and road map elements, from the original sensor data through a unified multimodal language model framework. The core objectives of EMMA are: 1. **Improve the universality and adaptability of autonomous driving systems**: By representing all non - sensor inputs (such as navigation instructions and the ego - vehicle state) and outputs (such as trajectories and 3D positions) as natural language texts, EMMA can jointly process multiple driving tasks in a unified language space and generate outputs for each task using task - specific prompts. 2. **Overcome the limitations of traditional modular systems**: Traditional autonomous driving systems usually adopt a modular design, and there are predefined interfaces between various modules, which may lead to cumulative errors between modules and limited cross - module communication. EMMA, through an end - to - end learning method, eliminates the need for symbolic interfaces and allows for the joint optimization of driving goals from the original sensor inputs. 3. **Leverage the advantages of large - scale pre - trained language models**: EMMA is based on large - scale pre - trained multimodal language models (such as Gemini), which have been trained on Internet - scale datasets and possess rich world knowledge and strong reasoning abilities. In this way, EMMA can better understand and handle complex real - world scenarios. 4. **Achieve multi - task joint training**: The paper demonstrates the performance improvement of EMMA through joint training in motion planning, object detection, and road map tasks, proving the potential of EMMA as a general - purpose model in the field of autonomous driving. However, EMMA also faces some challenges, including: - **Limited 3D spatial reasoning ability**: EMMA is currently unable to fuse camera inputs with LiDAR or radar data, resulting in limited ability in 3D spatial reasoning. - **High computational cost**: EMMA has relatively high computational requirements and requires expensive sensor simulations to support closed - loop evaluation. - **Limited image frame processing ability**: EMMA can only process a small number of image frames. Overall, the goal of EMMA is to improve the performance, universality, and adaptability of autonomous driving systems through a unified multimodal model framework while overcoming the limitations of traditional modular systems.

EMMA: End-to-End Multimodal Model for Autonomous Driving

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

Probing Multimodal LLMs as World Models for Driving

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

A Survey on Multimodal Large Language Models for Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models

MMFN: Multi-Modal-Fusion-Net for End-to-End Driving

CALMM-Drive: Confidence-Aware Autonomous Driving with Large Multimodal Model

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

ADriver-I: A General World Model for Autonomous Driving

Multimodal End-to-End Autonomous Driving

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Multi-Modal Sensor Fusion-Based Deep Neural Network for End-to-End Autonomous Driving With Scene Understanding

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model