DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

Lingxiao Lu,Jiangtong Li,Bo Zhang,Li Niu
DOI: https://doi.org/10.48550/arXiv.2309.15508
2024-01-24
Abstract:The goal of image composition is merging a foreground object into a background image to obtain a realistic composite image. Recently, generative composition methods are built on large pretrained diffusion models, due to their unprecedented image generation ability. However, they are weak in preserving the foreground object details. Inspired by recent text-to-image generation customized for certain object, we propose DreamCom by treating image composition as text-guided image inpainting customized for certain object. Specifically , we finetune pretrained text-guided image inpainting model based on a few reference images containing the same object, during which the text prompt contains a special token associated with this object. Then, given a new background, we can insert this object into the background with the text prompt containing the special token. In practice, the inserted object may be adversely affected by the background, so we propose masked attention mechanisms to avoid negative background interference. Experimental results on DreamEditBench and our contributed MureCom dataset show the outstanding performance of our DreamCom.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to seamlessly and naturally insert foreground objects into background images in the image synthesis task to generate realistic composite images. Specifically, although existing generative synthesis methods can generate high - quality images, they perform poorly in preserving foreground object details. In addition, these methods can usually only handle a single foreground image and cannot fully utilize the supplementary information provided by multiple reference images. ### Solutions Proposed in the Paper To solve the above problems, the authors proposed the **DreamCom** method, and its main contributions and innovation points include: 1. **Image synthesis based on text - guided image inpainting model**: - DreamCom regards the image synthesis task as a text - guided image inpainting task for specific objects. - By fine - tuning the pre - trained text - guided image inpainting model, the model can learn according to the reference images containing specific objects and associate the object with a special text token. 2. **Introducing Masked Cross - Attention mechanism**: - In order to avoid the negative impact of the background on foreground generation, especially in the cross - attention layer, the wrong correspondence between the background and the text prompt may lead to foreground generation failure. - By introducing the Masked Cross - Attention mechanism, the correspondence between the text prompt and the image area is restricted to ensure that foreground generation is not interfered by the background. 3. **Introducing Masked Self - Attention mechanism**: - To prevent the foreground color from being affected by the background color, especially in the self - attention layer, the interaction between foreground and background features may cause the foreground color to change. - By introducing the Masked Self - Attention mechanism in the first few self - attention layers, the interaction between the foreground and the background is blocked, while the last self - attention layer is maintained to ensure the compatibility of the foreground and the background. 4. **Constructing a new multi - reference image synthesis dataset MureCom**: - To supplement the deficiencies of the existing dataset DreamEditBench, the authors constructed a new dataset MureCom, which contains more diverse backgrounds and foreground objects. - The MureCom dataset provides 640 background images and 96 foreground objects from 32 categories, and each foreground object has 5 reference images. ### Experimental Results The experimental results show that DreamCom performs well on both the DreamEditBench and MureCom datasets, and is superior to other methods especially in foreground object detail preservation and foreground - background compatibility. Specifically: - **DINO and CLIP - I scores**: DreamCom achieved the highest scores on both of these indicators, indicating that the foreground objects it generates can better preserve the original details. - **SSIM and LPIPS scores**: DreamCom also performs well on these two indicators, indicating that it can effectively preserve the details of the background image in the composite image. - **User study**: In the user rating, DreamCom obtained the highest average ranking in both compatibility and fidelity, further verifying its superior performance. In summary, DreamCom successfully solves the deficiencies of existing methods in foreground detail preservation and foreground - background compatibility by introducing the Masked Cross - Attention and Masked Self - Attention mechanisms, and significantly improves the quality of image synthesis.