Abstract:Taking advantage of the many recent advances in deep learning, text-to-image generative models currently have the merit of attracting the general public attention. Two of these models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image. Based on a novel approach for image generation called diffusion models, text-to-image models enable the production of many different types of high resolution images, where human imagination is the only limit. However, these models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet. In addition, neither the codebase nor the models have been released. It consequently prevents the AI community from experimenting with these cutting-edge models, making the reproduction of their results complicated, if not impossible. In this thesis, we aim to contribute by firstly reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model. Highly based on DALL-E 2, we introduce several slight modifications to tackle the high computational cost induced. We thus have the opportunity to experiment in order to understand what these models are capable of, especially in a low resource regime. In particular, we provide additional and analyses deeper than the ones performed by the authors of DALL-E 2, including ablation studies. Besides, diffusion models use so-called guidance methods to help the generating process. We introduce a new guidance method which can be used in conjunction with other guidance methods to improve image quality. Finally, the images generated by our model are of reasonably good quality, without having to sustain the significant training costs of state-of-the-art text-to-image models.

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Emage: Non-Autoregressive Text-to-Image Generation

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Customization Assistant for Text-to-image Generation

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

InstructBooth: Instruction-following Personalized Text-to-Image Generation

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights

Class-Conditional self-reward mechanism for improved Text-to-Image models

Reward Incremental Learning in Text-to-Image Generation

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Personalized Image Generation with Large Multimodal Models

Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation

Referee Can Play: an Alternative Approach to Conditional Generation Via Model Inversion

Personalized and Sequential Text-to-Image Generation

Conditional Text-to-Image Generation with Reference Guidance

Fast Personalized Text to Image Synthesis with Attention Injection

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models