InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Xiefan Guo,Jinlin Liu,Miaomiao Cui,Jiankai Li,Hongyu Yang,Di Huang
2024-04-06
Abstract:Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the misalignment between the generated images and the given text prompts in text - to - image synthesis. Specifically, even the most advanced text - to - image diffusion models, after being trained on large - scale text - image datasets, still have difficulty in precisely generating images according to the given text prompts. The main challenges mentioned in the paper include subject neglect, subject mixing, and incorrect attribute binding. The existence of these problems is mainly attributed to invalid initial noise. To address these challenges, the authors propose a method named Initial Noise Optimization (INITNO), which aims to improve the alignment between the generated images and the text prompts by optimizing the initial noise. The INITNO method evaluates the initial noise by designing the cross - attention response score and the self - attention conflict score, and divides the initial latent space into valid and invalid regions. Then, through a carefully designed noise - optimization pipeline, the initial noise is guided to move towards the valid region, thereby enhancing the consistency between the generated images and the text prompts.