From Sora What We Can See: A Survey of Text-to-Video Generation

Rui Sun,Yumin Zhang,Tejal Shah,Jiahao Sun,Shuoying Zhang,Wenqi Li,Haoran Duan,Bo Wei,Rajiv Ranjan
2024-05-17
Abstract:With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address several key issues in the field of Text-to-Video (T2V) generation and explores the current state and future directions of this field through an in-depth analysis of the Sora system developed by OpenAI. Specifically, the paper focuses on the following aspects: 1. **Technological Progress**: Although Sora has achieved significant accomplishments in T2V generation, capable of producing high-quality videos lasting several minutes, it still faces some challenges. The paper categorizes existing methods, discusses the evolution of generators, the pursuit of quality enhancement, and the technical means to achieve realistic scenes. 2. **Algorithm Classification**: Current T2V generation algorithms are classified into three categories: GAN/VAE-based, Diffusion-based, and Autoregressive-based. Each type of algorithm has its unique advantages and limitations. 3. **Datasets and Evaluation Metrics**: The paper provides a detailed introduction to the datasets and evaluation standards used in T2V research, such as PSNR/SSIM, IS, FID, etc., helping researchers better understand and compare the performance of different models. 4. **Challenges and Future Directions**: The paper identifies the main challenges in the T2V field, including the coherence of dynamic motion, the generation of complex scenes, multi-object handling, and reasonable layout generation. It also proposes future research and development directions to overcome these difficulties. By comprehensively reviewing and analyzing the development of Sora and its related technologies, the paper aims to provide researchers in the T2V field with a systematic perspective, promoting further technological advancements in this area.