Text-to-Image Cross-Modal Generation: A Systematic Review

Maciej Żelaszczyk,Jacek Mańdziuk
2024-01-22
Abstract:We review research on generating visual data from text from the angle of "cross-modal generation." This point of view allows us to draw parallels between various methods geared towards working on input text and producing visual output, without limiting the analysis to narrow sub-areas. It also results in the identification of common templates in the field, which are then compared and contrasted both within pools of similar methods and across lines of research. We provide a breakdown of text-to-image generation into various flavors of image-from-text methods, video-from-text methods, image editing, self-supervised and graph-based approaches. In this discussion, we focus on research papers published at 8 leading machine learning conferences in the years 2016-2022, also incorporating a number of relevant papers not matching the outlined search criteria. The conducted review suggests a significant increase in the number of papers published in the area and highlights research gaps and potential lines of investigation. To our knowledge, this is the first review to systematically look at text-to-image generation from the perspective of "cross-modal generation."
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the systematic review and analysis in the field of text - to - image cross - modal generation. Specifically, the author aims to comprehensively review the research on text - to - image generation from the perspective of cross - modal generation. This perspective not only allows the author to establish connections between different methods, but also enables the identification of common templates in this field, and to compare the similarities and differences of these templates in similar method pools and between different research lines. In addition, by focusing on relevant research papers published in 8 major machine - learning conferences between 2016 and 2022, the author also points out the research gaps and potential research directions in this field. ### Main contributions of the paper: 1. **Systematic review**: This is the first research paper to systematically review text - to - image generation from the perspective of cross - modal generation. 2. **Identification of common templates**: The paper identifies and compares common templates in text - to - image generation, including methods based on Variational Auto - Encoder (VAE), Generative Adversarial Network (GAN) and diffusion models. 3. **Analysis of research trends**: Through the analysis of relevant research in recent years, the paper reveals the research trends and growth in this field. 4. **Identification of research gaps**: The paper points out the gaps in current research and possible future research directions. ### Specific content of the paper: - **Introduction section**: Introduces the progress of deep learning in the fields of image classification and natural language processing, as well as the development of generative models, especially the applications of VAE, GAN and diffusion models in image generation. - **Cross - modal generation**: Discusses how to use data of different modalities to generate data of other modalities, such as generating images from text. - **Text - to - image generation**: Describes in detail various methods of text - to - image generation, including methods based on VAE, GAN and diffusion models, and provides specific architectures and training processes. - **Evaluation metrics**: Introduces multiple quantitative metrics for evaluating the quality of generated images, such as Inception Score, R - precision, L2 error, Fréchet Inception Distance (FID), etc. - **Specific methods**: - **VAE template**: Describes how to use RNN encoders to process text inputs and generate images through variational auto - encoders. - **GAN template**: Introduces the working principles of generators and discriminators, and how to generate high - quality images through adversarial training. - **Diffusion model template**: Explains how to generate images through the forward diffusion process and the reverse denoising process, and shows specific training loss functions. ### Conclusion: Through systematic review and analysis, the paper not only provides researchers with a comprehensive perspective to understand the current situation in the field of text - to - image generation, but also points out the potential directions for future research. This is of great significance for promoting the further development of this field.