InfinityDrive: Breaking Time Limits in Driving World Models

Xi Guo,Chenjing Ding,Haoxuan Dou,Xin Zhang,Weixuan Tang,Wei Wu
2024-12-02
Abstract:Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576$\times$1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (approximately 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: <a class="link-external link-https" href="https://metadrivescape.github.io/papers_project/InfinityDrive/page.html" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the challenges faced by autonomous driving systems when dealing with complex scenarios. Specifically, these problems include: 1. **Low spatio - temporal resolution**: Existing driving world models usually operate within a short time window (< 30 frames), and the generated video sequences are only a few seconds long. This limits the model's ability to encode and process features over a long - time range. Some models that attempt to generate long - term videos sacrifice resolution in order to reduce the demand for computational resources, resulting in an inability to accurately represent complex driving environments. 2. **Accumulation of autoregressive errors**: Existing driving world models perform long - term predictions by iteratively predicting short - term segments and resetting the conditional images. However, inaccurate predictions can cause small deviations to gradually amplify over time, ultimately leading to significant drifts in long sequences, reducing the accuracy and consistency of the generated videos. 3. **Lack of diversity**: Under the same initial input and random noise conditions, the videos generated by current models are almost identical, which limits the model's ability to generate diverse driving scenarios. To solve these problems, the paper proposes InfinityDrive, a new driving world model with the following features: - **Efficient spatio - temporal co - construction module**: By dynamically adjusting the information density, it gives priority to processing spatial details at high resolution and enhances temporal modeling at low resolution, ensuring high - fidelity and consistency in long - time video generation. - **Extended time - training strategy**: Using the curriculum - learning method to gradually expand the time window, enabling the model to predict further into the future and effectively model long - term dependencies and behavior trends. - **Memory injection and retention mechanism**: Combined with the memory - curve - adaptive loss function, it prevents cumulative errors, maintains consistency with the conditions, and achieves high - quality video generation of more than 1500 frames (about 2 minutes). - **Joint image - to - video (I2V) and text - to - video (T2V) training**: Utilizing the greater variability of text data to enhance the model's diversity, enabling it to adapt to a wide range of conditions. Through these innovations, InfinityDrive has achieved state - of - the - art performance in generating high - fidelity, diverse, and temporally consistent minute - level videos.