Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang,Shaoyu Chen,Bencheng Liao,Xingyu Zhang,Wei Yin,Qian Zhang,Chang Huang,Wenyu Liu,Xinggang Wang
2024-10-30
Abstract:End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM's planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna's cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at <a class="link-external link-https" href="https://github.com/hustvl/Senna" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem this paper attempts to address is: In end-to-end autonomous driving, although large-scale data endows models with strong planning capabilities, models still perform poorly in complex and rare scenarios due to a lack of common sense. In contrast, large vision-language models (LVLM) excel in scene understanding and reasoning. However, directly using LVLM for trajectory prediction or control signal prediction is not ideal because LVLM is not good at precise numerical predictions. To address these issues, this paper proposes the Senna system, which combines a large vision-language model (Senna-VLM) and an end-to-end model (Senna-E2E). Senna-VLM generates high-level planning decisions in natural language form, while Senna-E2E generates specific trajectory planning based on these decisions. Through this structured planning approach, the Senna system aims to improve the safety, robustness, and generalization capabilities of autonomous driving. Specifically, this paper mainly explores and attempts to answer the following three key questions: 1. How to integrate LVLM with end-to-end models? 2. How to design LVLM suitable for driving tasks? 3. How to effectively train driving LVLM? By answering these questions, the experimental results of the Senna system on two datasets show that it achieves state-of-the-art performance in planning and has strong cross-scene generalization and transfer capabilities.