Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang,Shaoyu Chen,Bencheng Liao,Xingyu Zhang,Wei Yin,Qian Zhang,Chang Huang,Wenyu Liu,Xinggang Wang

2024-10-30

Abstract:End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM's planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna's cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at <a class="link-external link-https" href="https://github.com/hustvl/Senna" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Robotics

What problem does this paper attempt to address?

The problem this paper attempts to address is: In end-to-end autonomous driving, although large-scale data endows models with strong planning capabilities, models still perform poorly in complex and rare scenarios due to a lack of common sense. In contrast, large vision-language models (LVLM) excel in scene understanding and reasoning. However, directly using LVLM for trajectory prediction or control signal prediction is not ideal because LVLM is not good at precise numerical predictions. To address these issues, this paper proposes the Senna system, which combines a large vision-language model (Senna-VLM) and an end-to-end model (Senna-E2E). Senna-VLM generates high-level planning decisions in natural language form, while Senna-E2E generates specific trajectory planning based on these decisions. Through this structured planning approach, the Senna system aims to improve the safety, robustness, and generalization capabilities of autonomous driving. Specifically, this paper mainly explores and attempts to answer the following three key questions: 1. How to integrate LVLM with end-to-end models? 2. How to design LVLM suitable for driving tasks? 3. How to effectively train driving LVLM? By answering these questions, the experimental results of the Senna system on two datasets show that it achieves state-of-the-art performance in planning and has strong cross-scene generalization and transfer capabilities.

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

VLP: Vision Language Planning for Autonomous Driving

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

HE-Drive: Human-Like End-to-End Driving with Vision Language Models

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Probabilistic End-to-End Vehicle Navigation in Complex Dynamic Environments with Multimodal Sensor Fusion

Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

Empowering Autonomous Driving with Large Language Models: A Safety Perspective

Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

End-to-End Autonomous Driving With Semantic Depth Cloud Mapping and Multi-Agent

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners

Multi-Modal Sensor Fusion-Based Deep Neural Network for End-to-End Autonomous Driving With Scene Understanding