A Survey of AI Text-to-Image and AI Text-to-Video Generators

Aditi Singh
DOI: https://doi.org/10.1109/AIRC57904.2023.10303174
2023-11-11
Abstract:Text-to-Image and Text-to-Video AI generation models are revolutionary technologies that use deep learning and natural language processing (NLP) techniques to create images and videos from textual descriptions. This paper investigates cutting-edge approaches in the discipline of Text-to-Image and Text-to-Video AI generations. The survey provides an overview of the existing literature as well as an analysis of the approaches used in various studies. It covers data preprocessing techniques, neural network types, and evaluation metrics used in the field. In addition, the paper discusses the challenges and limitations of Text-to-Image and Text-to-Video AI generations, as well as future research directions. Overall, these models have promising potential for a wide range of applications such as video production, content creation, and digital marketing.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
This paper aims to solve the key problems and technical challenges in text - to - image (T2I) and text - to - video (T2V) generation. Specifically, the paper focuses on the following aspects: 1. **Technical Review**: The paper provides a technical review of current T2I and T2V generation models, including data pre - processing techniques, types of neural networks, and evaluation metrics. These techniques are the basis for achieving high - quality image and video generation. 2. **Model Performance**: The paper analyzes in detail the performance of different models, such as T2I generators like CogView2, DALL - E 2, Imagen, etc., and T2V generators like Make - A - Video, Imagen Video, Phenaki, GODIVA, and CogVideo. These models have different performances in terms of image quality and video coherence. 3. **Challenges and Limitations**: - **Data Set**: Obtaining and annotating high - quality training data is a major challenge. - **Interpretability**: The interpretability of the generated content is poor, and it is difficult to understand the logic behind the generated visual content. - **Computational Resources**: Generating high - resolution images and videos requires a large amount of computational resources, which limits their practical applications. - **Social Norms**: The generated content may not conform to social or public norms, leading to misunderstandings or inappropriate representations. 4. **Future Research Directions**: The paper discusses future research directions, including improving generation efficiency, enhancing the generalization ability of models, reducing computational costs, etc., to make these techniques more practical and widely applicable. Overall, the paper attempts to provide researchers in the T2I and T2V generation fields with a clear current situation and development direction through a comprehensive technical review and in - depth analysis.