CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Hidehisa Arai,Keita Miwa,Kento Sasaki,Yu Yamaguchi,Kohei Watanabe,Shunsuke Aoki,Issei Yamamoto
2024-08-19
Abstract:Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address a key challenge in autonomous driving technology, namely handling diverse and unpredictable driving environments. Specifically, the paper proposes the **CoVLA (Comprehensive Vision-Language-Action) dataset** to overcome the limitations of existing datasets in terms of scale and comprehensive annotations (especially language descriptions). By combining visual data, language descriptions, and driving actions, the CoVLA dataset can support the development of more complex and intelligent autonomous driving systems. #### Main Contributions 1. **Proposing the CoVLA dataset**: This is a large-scale dataset containing 10,000 real driving scene videos (over 80 hours), each with precise driving paths and detailed natural language descriptions. 2. **Scalable approach**: Utilizing automated data processing and subtitle generation pipelines, accurately estimating trajectories and automatically generating frame-level text subtitles through sensor fusion. 3. **Developing the CoVLA-Agent model**: This is a VLA-based model capable of end-to-end path planning in various driving scenarios, generating consistent and accurate driving scene descriptions and predicted trajectories. ### Summary The main goal of the paper is to advance autonomous driving research by introducing the CoVLA dataset, particularly improving the performance of autonomous driving systems in complex and unpredictable environments. By combining visual, language, and action modalities, the CoVLA dataset provides valuable resources for training and evaluating more reliable and intelligent autonomous driving systems.