Abstract:Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at <a class="link-external link-https" href="https://github.com/Bomingmiao/NoiseDiffusion" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of semantic consistency between the generated image and the input prompt in text - to - image synthesis. Although diffusion models have achieved remarkable success in generating realistic images, they still face challenges in ensuring the precise semantic alignment between the generated image and the input prompt. Specifically, existing methods such as InitNo improve semantic alignment by optimizing the initial noise latent variable, but their effectiveness is limited by the local optimal solution near the initial point and depends on limited information (such as attention maps). Therefore, this paper proposes a new framework that utilizes the semantic understanding ability of large - scale vision - language models (LVLMs) to guide the optimization of the initial noise latent variable and introduces a noise diffusion process to generate semantically more faithful images while maintaining distribution consistency. ### Main contributions of the paper: 1. **Propose a new framework**: Combine the semantic understanding ability of LVLMs to supervise the diffusion generation process, and introduce the noise diffusion method to optimize the initial noise latent variable while maintaining the distribution of the latent variable, thereby enhancing the semantic fidelity of the generated image to the input prompt. 2. **Theoretical analysis**: Provide a theoretical analysis of increasing the VQA score under the condition of updating the latent variable, and based on this, propose a strategy of using gradient information to select noise. 3. **Experimental verification**: Extensive experimental results show that this method can seamlessly and effectively improve the semantic fidelity of various diffusion models. ### Key technologies: - **VQA score**: Use large - scale vision - language models to calculate the alignment degree between the generated image and the input prompt. - **Noise diffusion**: Update the initial noise latent variable by gradually adding Gaussian noise, making it move towards an area that is more conducive to generating semantically consistent images. - **Dynamic step - size adjustment**: Dynamically adjust the step - size according to the VQA score, ensuring that the step - size is larger when the score is low and smaller when the score is high. - **Gradient - information - based noise selection**: Randomly sample a set of noises in each iteration and use gradient information to select the most appropriate noise for update. ### Experimental results: - **Qualitative comparison**: In simple and complex cases, the generated images show higher semantic fidelity, especially when dealing with complex semantic relationships. - **Quantitative comparison**: The experimental results on two datasets show that as the number of optimization rounds increases, both the VQA score and the CLIP score are significantly improved, and after 50 rounds of optimization, the performance is better than that of the InitNo method. - **Comparison of different optimization techniques**: Compared with other optimization methods (such as PGD, Mean - Variance, Random Sampling, Random Diffusion), the noise diffusion method can generate images that highly match the prompt at an early stage and perform stably throughout the entire optimization process. In conclusion, this paper effectively solves the problem of insufficient semantic alignment in existing methods for image generation by introducing the noise diffusion method, providing a new solution for the field of text - to - image synthesis.

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Saliency Guided Optimization of Diffusion Latents

Golden Noise for Diffusion Models: A Learning Framework

Spatial-Aware Latent Initialization for Controllable Image Generation

Enhancing semantic mapping in text-to-image diffusion via Gather-and-Bind

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Unleashing Text-to-Image Diffusion Models for Visual Perception

From text to mask: Localizing entities using the attention of text-to-image diffusion models

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Semantic Image Synthesis Via Diffusion Models

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

Create Your World: Lifelong Text-to-Image Diffusion

Text-driven Visual Synthesis with Latent Diffusion Prior

The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation

Aligning Diffusion Models with Noise-Conditioned Perception