Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Robin Zbinden

DOI: https://doi.org/10.48550/arXiv.2209.10948

2022-09-22

Abstract:Taking advantage of the many recent advances in deep learning, text-to-image generative models currently have the merit of attracting the general public attention. Two of these models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image. Based on a novel approach for image generation called diffusion models, text-to-image models enable the production of many different types of high resolution images, where human imagination is the only limit. However, these models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet. In addition, neither the codebase nor the models have been released. It consequently prevents the AI community from experimenting with these cutting-edge models, making the reproduction of their results complicated, if not impossible. In this thesis, we aim to contribute by firstly reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model. Highly based on DALL-E 2, we introduce several slight modifications to tackle the high computational cost induced. We thus have the opportunity to experiment in order to understand what these models are capable of, especially in a low resource regime. In particular, we provide additional and analyses deeper than the ones performed by the authors of DALL-E 2, including ablation studies. Besides, diffusion models use so-called guidance methods to help the generating process. We introduce a new guidance method which can be used in conjunction with other guidance methods to improve image quality. Finally, the images generated by our model are of reasonably good quality, without having to sustain the significant training costs of state-of-the-art text-to-image models.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the computational resources required by the current state - of - the - art text - to - image generation models (such as DALL - E 2 and Imagen) are too large, and the codebases and models are not publicly released, making it difficult for the AI community to reproduce the results of these models. Specifically: 1. **High computational cost**: State - of - the - art text - to - image generation models need to process large - scale datasets from the Internet, and the training process is very time - consuming and requires a large amount of computational resources. 2. **Lack of reproducibility**: Since the code and pre - trained models of these models are not public, it is difficult for researchers to verify and improve these models. To solve these problems, the author proposes the following goals: - **Review existing methods and techniques**: Analyze in detail the methods and techniques used by the current state - of - the - art text - to - image generation models. - **Propose one's own implementation scheme**: Based on DALL - E 2, the author introduces some minor modifications to reduce the computational cost, so that the model can be trained and experimented in a lower - resource environment. - **Provide more in - depth analysis**: More detailed experimental analysis including ablation studies to better understand the capabilities and limitations of the model. - **Introduce a new guidance method**: Propose a new image - guidance method that can be combined with other guidance methods to improve the quality of the generated images. Through these efforts, the author hopes to enable more researchers to experiment with and improve text - to - image generation models and promote the further development of this field.

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Emage: Non-Autoregressive Text-to-Image Generation

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

DiffusionGPT: LLM-Driven Text-to-Image Generation System

A Survey of Data-Driven 2D Diffusion Models for Generating Images from Text

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

Text-to-image Diffusion Models in Generative AI: A Survey

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

Are Diffusion Models Vision-And-Language Reasoners?

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction

GlyphDiffusion: Text Generation as Image Generation

Evaluating Text-to-Image Diffusion Models for Texturing Synthetic Data

CustomText: Customized Textual Image Generation using Diffusion Models

Text to Image Conversion using Stable Diffusion

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models