From Sora What We Can See: A Survey of Text-to-Video Generation

Rui Sun,Yumin Zhang,Tejal Shah,Jiahao Sun,Shuoying Zhang,Wenqi Li,Haoran Duan,Bo Wei,Rajiv Ranjan

2024-05-17

Abstract:With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address several key issues in the field of Text-to-Video (T2V) generation and explores the current state and future directions of this field through an in-depth analysis of the Sora system developed by OpenAI. Specifically, the paper focuses on the following aspects: 1. **Technological Progress**: Although Sora has achieved significant accomplishments in T2V generation, capable of producing high-quality videos lasting several minutes, it still faces some challenges. The paper categorizes existing methods, discusses the evolution of generators, the pursuit of quality enhancement, and the technical means to achieve realistic scenes. 2. **Algorithm Classification**: Current T2V generation algorithms are classified into three categories: GAN/VAE-based, Diffusion-based, and Autoregressive-based. Each type of algorithm has its unique advantages and limitations. 3. **Datasets and Evaluation Metrics**: The paper provides a detailed introduction to the datasets and evaluation standards used in T2V research, such as PSNR/SSIM, IS, FID, etc., helping researchers better understand and compare the performance of different models. 4. **Challenges and Future Directions**: The paper identifies the main challenges in the T2V field, including the coherence of dynamic motion, the generation of complex scenes, multi-object handling, and reasonable layout generation. It also proposes future research and development directions to overcome these difficulties. By comprehensively reviewing and analyzing the development of Sora and its related technologies, the paper aims to provide researchers in the T2V field with a systematic perspective, promoting further technological advancements in this area.

From Sora What We Can See: A Survey of Text-to-Video Generation

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Text-to-video generative artificial intelligence: sora in neurosurgery

Sora OpenAI's Prelude: Social Media Perspectives on Sora OpenAI and the Future of AI Video Generation

What Matters in Detecting AI-Generated Videos like Sora?

Analysing the Public Discourse around OpenAI's Text-To-Video Model 'Sora' using Topic Modeling

When Does Sora Show: The Beginning of TAO to Imaginative Intelligence and Scenarios Engineering

A Survey On Text-to-3D Contents Generation In The Wild

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

"Sora is Incredible and Scary": Emerging Governance Challenges of Text-to-Video Generative AI Models

From text to video with AI: the rise and potential of Sora in education and libraries

A Survey of AI Text-to-Image and AI Text-to-Video Generators

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Sora Generates Videos with Stunning Geometrical Consistency

A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights

Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era