DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Pengfei Zhou,Fangxiang Feng,Xiaojie Wang
2024-04-09
Abstract:Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem addressed in this paper is achieving image harmonization in image fusion, namely adjusting the foreground of the synthesized image to achieve visual consistency with the background. Existing diffusion models require high computational cost for training from scratch, while directly using pre-trained latent diffusion models may result in reconstruction errors due to image compression autoencoders, making them unsuitable for image generation tasks involving pixel-level evaluation metrics. To solve this, the paper proposes a method called DiffHarmony, which first adapts a pre-trained latent diffusion model to generate preliminary but possibly blurry harmonized images, and then enhances image clarity through the use of higher-resolution inputs and additional refinement stages. Experimental results demonstrate the superior performance of this method in image harmonization tasks.