CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Hidehisa Arai,Keita Miwa,Kento Sasaki,Yu Yamaguchi,Kohei Watanabe,Shunsuke Aoki,Issei Yamamoto

2024-08-19

Abstract:Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address a key challenge in autonomous driving technology, namely handling diverse and unpredictable driving environments. Specifically, the paper proposes the **CoVLA (Comprehensive Vision-Language-Action) dataset** to overcome the limitations of existing datasets in terms of scale and comprehensive annotations (especially language descriptions). By combining visual data, language descriptions, and driving actions, the CoVLA dataset can support the development of more complex and intelligent autonomous driving systems. #### Main Contributions 1. **Proposing the CoVLA dataset**: This is a large-scale dataset containing 10,000 real driving scene videos (over 80 hours), each with precise driving paths and detailed natural language descriptions. 2. **Scalable approach**: Utilizing automated data processing and subtitle generation pipelines, accurately estimating trajectories and automatically generating frame-level text subtitles through sensor fusion. 3. **Developing the CoVLA-Agent model**: This is a VLA-based model capable of end-to-end path planning in various driving scenarios, generating consistent and accurate driving scene descriptions and predicted trajectories. ### Summary The main goal of the paper is to advance autonomous driving research by introducing the CoVLA dataset, particularly improving the performance of autonomous driving systems in complex and unpredictable environments. By combining visual, language, and action modalities, the CoVLA dataset provides valuable resources for training and evaluating more reliable and intelligent autonomous driving systems.

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

CarLLaVA: Vision language models for camera-only closed-loop driving

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Vision Language Models in Autonomous Driving: A Survey and Outlook

VLP: Vision Language Planning for Autonomous Driving

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

DriveLM: Driving with Graph Visual Question Answering

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

V2V4Real: A Real-world Large-scale Dataset for Vehicle-to-Vehicle Cooperative Perception

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

DriveLLaVA: Human-Level Behavior Decisions via Vision Language Model

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Multi-Frame Vision-Language Model for Long-form Reasoning in Driver Behavior Analysis

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving