BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Qinchan,Kenneth Chen,Changyue,Qi Sun
2024-12-08
Abstract:Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problems of high computational requirements and energy consumption in diffusion models for text - to - image generation tasks. Specifically, the authors observe that the amount of computation required for images generated from different text prompts varies, but most of the existing optimization methods simplify neural networks or optimize text prompts in a fixed manner, which leads to inconsistent quality of generated images. **Core problems**: 1. **Trade - off between computational efficiency and perceptual quality**: Not all denoising computation steps contribute equally to the improvement of human - perceived quality. Different text prompts may require different computational efforts to achieve the desired content quality. 2. **Environmental and social impacts**: High computational requirements not only limit deployment on devices, but also raise social concerns about energy consumption and environmental impacts. ### Proposed solutions: To address these problems, the authors propose the **BudgetFusion** model, which is a perception - guided adaptive diffusion model. The main goal of BudgetFusion is to predict the optimal number of denoising steps according to a given text prompt, thereby significantly improving computational efficiency while ensuring perceptual quality. **Specific contributions**: - **Perception - guided adaptive denoising step prediction**: By predicting the relationship between multi - scale perception metrics and the number of denoising steps, the optimal number of denoising steps is determined to maximize the "per - step perceptual quality gain". - **Large - scale dataset generation**: A large synthetic dataset is generated using more than 18,000 text prompts and 12 time steps for training and evaluating the model. - **Multi - scale perception metrics**: Three - level perception metrics of pixel - level, mid - level layout and high - level semantics are introduced to comprehensively evaluate the quality of generated images. - **Experimental verification**: Through numerical analysis and user studies, it is proved that BudgetFusion can save up to 5 seconds of generation time while maintaining perceptual similarity. ### Summary: BudgetFusion is an innovative model that optimizes the computational efficiency of diffusion models through a perception - guided method, solving the problems of computational resource waste and inconsistent quality in existing methods. This work is expected to promote future research on how to use computational resources more effectively in generative models while improving the quality of human perception.