PromptFix: You Prompt and We Fix the Photo

Yongsheng Yu,Ziyun Zeng,Hang Hua,Jianlong Fu,Jiebo Luo
2024-10-11
Abstract:Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of the diffusion process leads to deficiencies in image generation or editing tasks that require the detailed preservation of the generated images. To address these limitations, we propose PromptFix, a comprehensive framework that enables diffusion models to follow human instructions to perform a wide variety of image-processing tasks. First, we construct a large-scale instruction-following dataset that covers comprehensive image-processing tasks, including low-level tasks, image editing, and object creation. Next, we propose a high-frequency guidance sampling method to explicitly control the denoising process and preserve high-frequency details in unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing Vision-Language Models (VLMs) to enhance text prompts and improve the model's task generalization. Experimental results show that PromptFix outperforms previous methods in various image-processing tasks. Our proposed model also achieves comparable inference efficiency with these baseline models and exhibits superior zero-shot capabilities in blind restoration and combination tasks. The dataset and code are available at <a class="link-external link-https" href="https://www.yongshengyu.com/PromptFix-Page" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Lack of diverse instruction - following data**: Existing diffusion models perform well in image - generation tasks and can process images according to human instructions. However, due to the lack of diverse instruction - following data, these models have difficulty effectively identifying and executing user - defined instructions, especially performing poorly in low - level tasks (such as image inpainting, super - resolution, etc.). 2. **Randomness in the diffusion process leads to loss of details**: The random nature of diffusion models makes it easy to lose image details in tasks that require detailed preservation of generated images (such as image editing or generation), especially high - frequency information (such as text, edges, etc.). To solve these problems, the authors propose the **PromptFix** framework, which specifically includes the following aspects: - **Construct a large - scale instruction - following dataset**: Cover a wide range of image - processing tasks, including low - level tasks, image editing, and object creation. This dataset contains approximately 1.01 million input - target - instruction triplets, covering a variety of low - level tasks, such as image inpainting, defogging, colorization, super - resolution, low - light enhancement, snow removal, and watermark removal. - **High - frequency - guided sampling method**: By explicitly controlling the denoising process, ensure that the high - frequency details in the unprocessed areas are preserved. This method uses a low - pass filter to calculate the fidelity constraint and fuses VAE skip - connection features during the inference process to maintain spatial - detail consistency. - **Auxiliary prompt module**: Utilize visual - language models (VLMs) to enhance text prompts and improve the model's task - generalization ability. This module adapts instructions and auxiliary prompts through an additional attention layer and intermittently omits instruction prompts during the training process, thereby enhancing the model's ability to process severely degraded images. Through these methods, PromptFix can better understand users' custom - defined instructions and exhibit superior performance in various image - processing tasks, especially showing strong capabilities in zero - shot blind restoration and combination tasks. ### Formula summary - **Fidelity constraint in high - frequency - guided sampling**: \[ L(I, D_\theta(z_t\rightarrow0))=\|F(I)-F(D_\theta(z_t\rightarrow0))\|_2^2+\|S(I)-S(D_\theta(z_t\rightarrow0))\|_2^2 \] where \(F(\cdot)\) is the Fourier filtering operator and \(S(\cdot)\) is the Sobel edge - detection operator. - **Forward process of the diffusion model**: \[ z_t = q(z_0,\epsilon,t)=\alpha_tz_0+\sigma_t\epsilon,\quad\forall t\in[0,T] \] where \(\alpha_t\) and \(\sigma_t\) are coefficients that manage the signal - to - noise ratio. - **Probability - flow ODE of the backward diffusion process**: \[ dz=\left(f(z,t)-\frac{1}{2}g(t)^2\nabla_z\log p_t(z)\right)dt \] These formulas and methods together ensure the efficiency and accuracy of PromptFix in image - processing tasks.