Abstract:Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of the diffusion process leads to deficiencies in image generation or editing tasks that require the detailed preservation of the generated images. To address these limitations, we propose PromptFix, a comprehensive framework that enables diffusion models to follow human instructions to perform a wide variety of image-processing tasks. First, we construct a large-scale instruction-following dataset that covers comprehensive image-processing tasks, including low-level tasks, image editing, and object creation. Next, we propose a high-frequency guidance sampling method to explicitly control the denoising process and preserve high-frequency details in unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing Vision-Language Models (VLMs) to enhance text prompts and improve the model's task generalization. Experimental results show that PromptFix outperforms previous methods in various image-processing tasks. Our proposed model also achieves comparable inference efficiency with these baseline models and exhibits superior zero-shot capabilities in blind restoration and combination tasks. The dataset and code are available at <a class="link-external link-https" href="https://www.yongshengyu.com/PromptFix-Page" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Lack of diverse instruction - following data**: Existing diffusion models perform well in image - generation tasks and can process images according to human instructions. However, due to the lack of diverse instruction - following data, these models have difficulty effectively identifying and executing user - defined instructions, especially performing poorly in low - level tasks (such as image inpainting, super - resolution, etc.). 2. **Randomness in the diffusion process leads to loss of details**: The random nature of diffusion models makes it easy to lose image details in tasks that require detailed preservation of generated images (such as image editing or generation), especially high - frequency information (such as text, edges, etc.). To solve these problems, the authors propose the **PromptFix** framework, which specifically includes the following aspects: - **Construct a large - scale instruction - following dataset**: Cover a wide range of image - processing tasks, including low - level tasks, image editing, and object creation. This dataset contains approximately 1.01 million input - target - instruction triplets, covering a variety of low - level tasks, such as image inpainting, defogging, colorization, super - resolution, low - light enhancement, snow removal, and watermark removal. - **High - frequency - guided sampling method**: By explicitly controlling the denoising process, ensure that the high - frequency details in the unprocessed areas are preserved. This method uses a low - pass filter to calculate the fidelity constraint and fuses VAE skip - connection features during the inference process to maintain spatial - detail consistency. - **Auxiliary prompt module**: Utilize visual - language models (VLMs) to enhance text prompts and improve the model's task - generalization ability. This module adapts instructions and auxiliary prompts through an additional attention layer and intermittently omits instruction prompts during the training process, thereby enhancing the model's ability to process severely degraded images. Through these methods, PromptFix can better understand users' custom - defined instructions and exhibit superior performance in various image - processing tasks, especially showing strong capabilities in zero - shot blind restoration and combination tasks. ### Formula summary - **Fidelity constraint in high - frequency - guided sampling**: \[ L(I, D_\theta(z_t\rightarrow0))=\|F(I)-F(D_\theta(z_t\rightarrow0))\|_2^2+\|S(I)-S(D_\theta(z_t\rightarrow0))\|_2^2 \] where \(F(\cdot)\) is the Fourier filtering operator and \(S(\cdot)\) is the Sobel edge - detection operator. - **Forward process of the diffusion model**: \[ z_t = q(z_0,\epsilon,t)=\alpha_tz_0+\sigma_t\epsilon,\quad\forall t\in[0,T] \] where \(\alpha_t\) and \(\sigma_t\) are coefficients that manage the signal - to - noise ratio. - **Probability - flow ODE of the backward diffusion process**: \[ dz=\left(f(z,t)-\frac{1}{2}g(t)^2\nabla_z\log p_t(z)\right)dt \] These formulas and methods together ensure the efficiency and accuracy of PromptFix in image - processing tasks.

PromptFix: You Prompt and We Fix the Photo

In-Context Learning Unlocked for Diffusion Models

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Prompt Diffusion Robustifies Any-Modality Prompt Learning

Prompt-In-Prompt Learning for Universal Image Restoration

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

Dynamic Prompt Optimizing for Text-to-Image Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Optimizing Prompts for Text-to-Image Generation

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

PromptCoT: Align Prompt Distribution Via Adapted Chain-of-Thought

MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond

Textual Prompt Guided Image Restoration

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models