Abstract:Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper attempts to address the issues of user intent alignment and semantic control in text-to-image generation models. Specifically, although existing text-to-image generation models can produce high-quality images from textual descriptions, the generated images often do not fully align with the user's initial intent, and small changes to the input prompts can lead to significant variations in the generated images. This lack of fine-grained semantic control during the generation process is problematic for users. Therefore, the paper proposes the **Semantic Guidance (SEGA)** method, which aims to allow users to more precisely control the generated images by flexibly manipulating the semantic direction during the diffusion process. ### Main Contributions 1. **Definition of Semantic Guidance**: The paper formally defines the concept of semantic guidance and discusses the numerical intuition of the corresponding semantic space. 2. **Robustness, Uniqueness, Monotonicity, and Isolation**: It demonstrates the robustness, uniqueness, monotonicity, and isolation of semantic vectors. 3. **Extensive Empirical Evaluation**: The paper provides a thorough empirical evaluation of SEGA's semantic control across various tasks, proving its versatility, flexibility, and improvements over existing methods. 4. **Comparison with Related Methods**: Through user preference surveys and direct comparisons, the paper shows the advantages of SEGA over other methods. ### Solution SEGA achieves the above goals through the following means: - **No Additional Training or Architectural Extensions**: SEGA can be directly applied to existing diffusion models without the need for additional training or architectural extensions. - **Semantic Control Based on Noise Estimation**: By calculating the difference between conditional and unconditional noise estimates, vectors representing specific semantic concepts are extracted. - **Multi-Concept Combination**: Multiple semantic concepts can be applied simultaneously, with each concept's vector affecting only specific parts of the image without interfering with each other. - **Parameter Adjustment for Fine-Grained Control**: By adjusting parameters such as the guidance ratio and warm-up period, users can intuitively control the details of the generated images. ### Experimental Results The paper validates the effectiveness of SEGA through extensive experiments: - **Facial Attribute Editing**: Experiments on the CelebA dataset successfully edited 10 facial attributes, including glasses, smile, baldness, beard, etc. - **Simultaneous Multi-Concept Editing**: SEGA demonstrated the ability to apply multiple semantic concepts simultaneously without interference. - **Image Quality Improvement**: By evaluating the edited images with FID scores, SEGA was found to not only achieve precise semantic control but also significantly improve the quality of the generated images. In summary, SEGA provides an efficient, flexible, and intuitive method for users to achieve finer semantic control in the text-to-image generation process.

SEGA: Instructing Text-to-Image Models using Semantic Guidance

Emage: Non-Autoregressive Text-to-Image Generation

Semantic Guidance Tuning for Text-To-Image Diffusion Models

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Segmentation-Free Guidance for Text-to-Image Diffusion Models

LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis

Self-Guidance: Boosting Flow and Diffusion Generation on Their Own

SpaText: Spatio-Textual Representation for Controllable Image Generation

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Diffusion Self-Guidance for Controllable Image Generation

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

Scribble-Guided Diffusion for Training-free Text-to-Image Generation

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing