Diffusion Models Generate Images Like Painters: an Analytical Theory of Outline First, Details Later

Binxu Wang,John J. Vastola
2024-03-26
Abstract:How do diffusion generative models convert pure noise into meaningful images? In a variety of pretrained diffusion models (including conditional latent space models like Stable Diffusion), we observe that the reverse diffusion process that underlies image generation has the following properties: (i) individual trajectories tend to be low-dimensional and resemble 2D `rotations'; (ii) high-variance scene features like layout tend to emerge earlier, while low-variance details tend to emerge later; and (iii) early perturbations tend to have a greater impact on image content than later perturbations. To understand these phenomena, we derive and study a closed-form solution to the probability flow ODE for a Gaussian distribution, which shows that the reverse diffusion state rotates towards a gradually-specified target on the image manifold. It also shows that generation involves first committing to an outline, and then to finer and finer details. We find that this solution accurately describes the initial phase of image generation for pretrained models, and can in principle be used to make image generation more efficient by skipping reverse diffusion steps. Finally, we use our solution to characterize the image manifold in Stable Diffusion. Our viewpoint reveals an unexpected similarity between generation by GANs and diffusion and provides a conceptual link between diffusion and image retrieval.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The paper primarily explores the behavioral characteristics of diffusion generative models in the image generation process and proposes a theoretical framework to explain these characteristics. Specifically, the paper focuses on the following core issues: 1. **Generation Order**: The study finds that in a series of pre-trained diffusion models, there is a phenomenon of "contour first, details later" during the reverse diffusion process. That is, when generating images, the general layout of the scene appears first, followed by the filling in of details. 2. **Trajectory Characteristics**: A single generation trajectory is often low-dimensional and similar to a 2D rotation. This means that during the transition from pure noise to a meaningful image, the path of change in the image state can be approximated as a rotational motion on a plane. 3. **Perturbation Impact**: Early perturbations have a greater impact on the image content than later perturbations. This implies that small changes in the initial stages of image generation may have a larger impact on the final result. To understand these phenomena, the authors derive a closed-form solution for the probability flow ordinary differential equation (ODE) of Gaussian distributions and use this solution to show how the reverse diffusion state gradually approaches the target image. Additionally, this solution can describe the initial stages of image generation by pre-trained models and can be used to improve image generation efficiency, for example, by skipping certain reverse diffusion steps. The paper also proposes an analytical theory-based method to accelerate the sampling process of unconditional diffusion models, i.e., using the Gaussian analytical solution for "teleportation," thereby reducing the number of steps required for neural network function evaluation. This method has been proven effective in experiments, significantly improving generation speed while maintaining the quality of the generated images. Finally, through the analysis of sampling trajectories, the paper provides a method to characterize the image manifold, which helps to better understand the internal working mechanisms of diffusion models and the spatial structure of generated images.