Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

Junsung Lee,Minsoo Kang,Bohyung Han
2024-09-12
Abstract:We propose a simple but effective training-free approach tailored to diffusion-based image-to-image translation. Our approach revises the original noise prediction network of a pretrained diffusion model by introducing a noise correction term. We formulate the noise correction term as the difference between two noise predictions; one is computed from the denoising network with a progressive interpolation of the source and target prompt embeddings, while the other is the noise prediction with the source prompt embedding. The final noise prediction network is given by a linear combination of the standard denoising term and the noise correction term, where the former is designed to reconstruct must-be-preserved regions while the latter aims to effectively edit regions of interest relevant to the target prompt. Our approach can be easily incorporated into existing image-to-image translation methods based on diffusion models. Extensive experiments verify that the proposed technique achieves outstanding performance with low latency and consistently improves existing frameworks when combined with them.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two key challenges in text - driven image - to - image translation tasks: 1. **Difficulty in finding the ideal starting point of the reverse diffusion process**: In the reverse diffusion process, it is very difficult to determine an appropriate initial noise state so that the generated image can both reflect the target prompt and keep the background or structure of the source image unchanged. 2. **Difficulty in editing specific regions**: When generating an image, how to only modify the specific regions related to the target prompt without distorting the rest of the image is a difficult problem. To solve these problems, the authors propose a simple and effective training - free method based on the diffusion model. This method improves the standard noise prediction network of the pre - trained diffusion model by introducing a noise correction term. Specifically, the noise correction term is calculated by gradually interpolating the prompt embeddings of the source prompt and the target prompt, thereby achieving selective editing of the region of interest while preserving the overall structure and background of the image. ### Method Overview The noise prediction network proposed by the authors consists of two parts: - **Standard denoising term**: Used to reconstruct the overall structure and background of the source image. - **Noise correction term**: Selectively modifies the regions related to the target prompt by gradually interpolating the prompt embeddings of the source prompt and the target prompt. The final noise prediction network can be represented as a linear combination of these two parts: \[ \hat{\epsilon}_\theta(x_{t}^{\text{tgt}}, t, y_{\text{tgt}}) = \epsilon_\theta(x_{t}^{\text{src}}, t, y_{\text{src}}) + \gamma \Delta \epsilon_\theta(x_{t}^{\text{tgt}}, t, y_t) \] where \(\Delta \epsilon_\theta(x_{t}^{\text{tgt}}, t, y_t)\) is the noise correction term, defined as: \[ \Delta \epsilon_\theta(x_{t}^{\text{tgt}}, t, y_t) = \epsilon_\theta(x_{t}^{\text{tgt}}, t, y_t) - \epsilon_\theta(x_{t}^{\text{tgt}}, t, y_{\text{src}}) \] ### Main Contributions 1. **Proposed a new noise prediction strategy**: By gradually updating the text prompt embeddings, a smooth transition from the source prompt to the target prompt is achieved. 2. **Defined the noise correction term**: Ensure that the generated image can both reflect the target prompt and maintain the structure and background of the source image. 3. **Experimental results show**: This method performs well on multiple tasks and can significantly improve performance when combined with existing methods. In conclusion, this paper aims to solve the key challenges in text - driven image - to - image translation tasks by introducing the noise correction term and the method of gradually interpolating prompt embeddings, thereby achieving high - quality image editing.