InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Xiefan Guo,Jinlin Liu,Miaomiao Cui,Jiankai Li,Hongyu Yang,Di Huang

2024-04-06

Abstract:Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the misalignment between the generated images and the given text prompts in text - to - image synthesis. Specifically, even the most advanced text - to - image diffusion models, after being trained on large - scale text - image datasets, still have difficulty in precisely generating images according to the given text prompts. The main challenges mentioned in the paper include subject neglect, subject mixing, and incorrect attribute binding. The existence of these problems is mainly attributed to invalid initial noise. To address these challenges, the authors propose a method named Initial Noise Optimization (INITNO), which aims to improve the alignment between the generated images and the text prompts by optimizing the initial noise. The INITNO method evaluates the initial noise by designing the cross - attention response score and the self - attention conflict score, and divides the initial latent space into valid and invalid regions. Then, through a carefully designed noise - optimization pipeline, the initial noise is guided to move towards the valid region, thereby enhancing the consistency between the generated images and the text prompts.

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation

Saliency Guided Optimization of Diffusion Latents

FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization

Golden Noise for Diffusion Models: A Learning Framework

Spatial-Aware Latent Initialization for Controllable Image Generation

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Dynamic Prompt Optimizing for Text-to-Image Generation

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

Model-Agnostic Human Preference Inversion in Diffusion Models

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models