Abstract:Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576$\times$1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (approximately 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: <a class="link-external link-https" href="https://metadrivescape.github.io/papers_project/InfinityDrive/page.html" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the challenges faced by autonomous driving systems when dealing with complex scenarios. Specifically, these problems include: 1. **Low spatio - temporal resolution**: Existing driving world models usually operate within a short time window (< 30 frames), and the generated video sequences are only a few seconds long. This limits the model's ability to encode and process features over a long - time range. Some models that attempt to generate long - term videos sacrifice resolution in order to reduce the demand for computational resources, resulting in an inability to accurately represent complex driving environments. 2. **Accumulation of autoregressive errors**: Existing driving world models perform long - term predictions by iteratively predicting short - term segments and resetting the conditional images. However, inaccurate predictions can cause small deviations to gradually amplify over time, ultimately leading to significant drifts in long sequences, reducing the accuracy and consistency of the generated videos. 3. **Lack of diversity**: Under the same initial input and random noise conditions, the videos generated by current models are almost identical, which limits the model's ability to generate diverse driving scenarios. To solve these problems, the paper proposes InfinityDrive, a new driving world model with the following features: - **Efficient spatio - temporal co - construction module**: By dynamically adjusting the information density, it gives priority to processing spatial details at high resolution and enhances temporal modeling at low resolution, ensuring high - fidelity and consistency in long - time video generation. - **Extended time - training strategy**: Using the curriculum - learning method to gradually expand the time window, enabling the model to predict further into the future and effectively model long - term dependencies and behavior trends. - **Memory injection and retention mechanism**: Combined with the memory - curve - adaptive loss function, it prevents cumulative errors, maintains consistency with the conditions, and achieves high - quality video generation of more than 1500 frames (about 2 minutes). - **Joint image - to - video (I2V) and text - to - video (T2V) training**: Utilizing the greater variability of text data to enhance the model's diversity, enabling it to adapt to a wide range of conditions. Through these innovations, InfinityDrive has achieved state - of - the - art performance in generating high - fidelity, diverse, and temporally consistent minute - level videos.

InfinityDrive: Breaking Time Limits in Driving World Models

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

GAIA-1: A Generative World Model for Autonomous Driving

Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

ADriver-I: A General World Model for Autonomous Driving

Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model

Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model

DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration