GameGen-X: Interactive Open-world Game Video Generation

Haoxuan Che,Xuanhua He,Quande Liu,Cheng Jin,Hao Chen
2024-11-02
Abstract:We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of generating and controlling high-quality, complex open-world game video content. Specifically, the authors propose GameGen-X, the first diffusion transformer model capable of generating and simulating open-world video games with interactive control. The main objectives include: 1. **Generating high-quality game content**: Creating open-world game videos that include dynamic environments, diverse characters, engaging events, and complex actions. 2. **Achieving interactive control**: Allowing users to influence the generated content through text instructions and keyboard inputs, making the generated videos responsive to user operations and simulating a real interactive gaming experience. To achieve these goals, the authors conducted the following work: - **Dataset construction**: Created OGameData, the first large-scale dataset for open-world game video generation and control, containing over 1 million video clips from more than 150 next-generation games. - **Two-stage training**: - **Base model pre-training**: Pre-trained on the OGameData-GEN dataset for text-to-video generation and video continuation tasks, enabling the model to generate high-quality game content. - **Instruction fine-tuning**: Designed InstructNet to adjust the generated content through multimodal control signals (such as text instructions and keyboard inputs) to achieve interactive control. Through these methods, GameGen-X has made significant progress in generating and controlling open-world game videos, demonstrating the potential of generative models in game content design and development.