Golden Noise for Diffusion Models: A Learning Framework

Zikai Zhou,Shitong Shao,Lichen Bai,Zhiqiang Xu,Bo Han,Zeke Xie
2024-11-14
Abstract:Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are ``golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the \textit{noise prompt}, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the \textit{noise prompt learning} framework that systematically learns ``prompted'' golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small \textit{noise prompt network}~(NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a golden noise instead of a random noise without accessing the original pipeline.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of the influence of random noise on the quality of generated images in text - to - image (T2I) diffusion models. Specifically, the paper focuses on how to obtain "golden noise" through machine - learning methods to improve the consistency between the generated image and the text prompt as well as the overall quality. #### Background and problem description 1. **The influence of noise on generated images**: - Text - to - image diffusion models rely on text prompts and random Gaussian noise to generate images. - Even a slight change in noise can significantly affect the quality of the final generated image. - Some specific noises (i.e., "golden noise") can improve the semantic consistency and overall quality of the generated image. 2. **Limitations of existing methods**: - Although existing methods can optimize noise, they are often difficult to be widely applied for the following reasons: - They cannot generalize well to different datasets or diffusion models. - They require additional time to optimize noise. - They need to deeply modify the original pipeline. - They need to calculate a specific loss function for each prompt, which is unrealistic in practical applications. #### Main contributions of the paper To overcome the above problems, the paper makes the following three main contributions: 1. **Introducing the new concept of "noise prompt"**: - By adding small perturbations from the text prompt to the random Gaussian noise, it is transformed into "golden noise". - This golden noise is rich in semantic information and can be adjusted according to the given text prompt. 2. **Constructing a noise - prompt learning framework**: - Proposing a noise - prompt dataset (NPD) collection pipeline for generating large - scale noise - pair datasets. - Designing a small noise - prompt network (NPNet), which can directly convert random noise into golden noise. - NPNet, as a plug - in module, can improve the quality of generated images without changing the original inference pipeline. 3. **Extensive experimental verification**: - Conducting a large number of experiments on multiple mainstream diffusion models (such as StableDiffusion - xl, DreamShaper - xl - v2 - turbo, Hunyuan - DiT). - Evaluating the effectiveness and generalization ability of the model using 6 human preference measurement indicators (such as HPSv2, PickScore, AES, etc.). - The experimental results show that NPNet not only improves the overall quality and aesthetic style of the generated images but also has significant improvements in various measurement indicators. #### Formula summary - **Forward process of the diffusion model**: \[ x_t=\alpha_t x_0+\sigma_t \epsilon_t \] where \(\epsilon_t \sim N(0, I)\), \(t \in \{0, 1, \cdots, T\}\), \(\alpha_t\) and \(\sigma_t\) are predefined noise scheduling parameters. - **DDIM inverse process**: \[ x_{t - 1}=\text{DDIM}(x_t)=\alpha_{t - 1}\left(x_t-\frac{\sigma_t \epsilon_\theta(x_t, t)}{\alpha_t}\right)+\sigma_{t - 1} \epsilon_\theta(x_t, t) \] - **Classifier - free guidance (CFG)**: \[ \epsilon_{\text{pred}}=(\omega + 1) \epsilon_\theta(x_t, t|c)-\omega \epsilon_\theta(x_t, t|\emptyset) \] - **Loss function of the noise - prompt learning task**: \[ \phi^*=\arg \min_\phi \m