Abstract:The goal of image composition is merging a foreground object into a background image to obtain a realistic composite image. Recently, generative composition methods are built on large pretrained diffusion models, due to their unprecedented image generation ability. However, they are weak in preserving the foreground object details. Inspired by recent text-to-image generation customized for certain object, we propose DreamCom by treating image composition as text-guided image inpainting customized for certain object. Specifically , we finetune pretrained text-guided image inpainting model based on a few reference images containing the same object, during which the text prompt contains a special token associated with this object. Then, given a new background, we can insert this object into the background with the text prompt containing the special token. In practice, the inserted object may be adversely affected by the background, so we propose masked attention mechanisms to avoid negative background interference. Experimental results on DreamEditBench and our contributed MureCom dataset show the outstanding performance of our DreamCom.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to seamlessly and naturally insert foreground objects into background images in the image synthesis task to generate realistic composite images. Specifically, although existing generative synthesis methods can generate high - quality images, they perform poorly in preserving foreground object details. In addition, these methods can usually only handle a single foreground image and cannot fully utilize the supplementary information provided by multiple reference images. ### Solutions Proposed in the Paper To solve the above problems, the authors proposed the **DreamCom** method, and its main contributions and innovation points include: 1. **Image synthesis based on text - guided image inpainting model**: - DreamCom regards the image synthesis task as a text - guided image inpainting task for specific objects. - By fine - tuning the pre - trained text - guided image inpainting model, the model can learn according to the reference images containing specific objects and associate the object with a special text token. 2. **Introducing Masked Cross - Attention mechanism**: - In order to avoid the negative impact of the background on foreground generation, especially in the cross - attention layer, the wrong correspondence between the background and the text prompt may lead to foreground generation failure. - By introducing the Masked Cross - Attention mechanism, the correspondence between the text prompt and the image area is restricted to ensure that foreground generation is not interfered by the background. 3. **Introducing Masked Self - Attention mechanism**: - To prevent the foreground color from being affected by the background color, especially in the self - attention layer, the interaction between foreground and background features may cause the foreground color to change. - By introducing the Masked Self - Attention mechanism in the first few self - attention layers, the interaction between the foreground and the background is blocked, while the last self - attention layer is maintained to ensure the compatibility of the foreground and the background. 4. **Constructing a new multi - reference image synthesis dataset MureCom**: - To supplement the deficiencies of the existing dataset DreamEditBench, the authors constructed a new dataset MureCom, which contains more diverse backgrounds and foreground objects. - The MureCom dataset provides 640 background images and 96 foreground objects from 32 categories, and each foreground object has 5 reference images. ### Experimental Results The experimental results show that DreamCom performs well on both the DreamEditBench and MureCom datasets, and is superior to other methods especially in foreground object detail preservation and foreground - background compatibility. Specifically: - **DINO and CLIP - I scores**: DreamCom achieved the highest scores on both of these indicators, indicating that the foreground objects it generates can better preserve the original details. - **SSIM and LPIPS scores**: DreamCom also performs well on these two indicators, indicating that it can effectively preserve the details of the background image in the composite image. - **User study**: In the user rating, DreamCom obtained the highest average ranking in both compatibility and fidelity, further verifying its superior performance. In summary, DreamCom successfully solves the deficiencies of existing methods in foreground detail preservation and foreground - background compatibility by introducing the Masked Cross - Attention and Masked Self - Attention mechanisms, and significantly improves the quality of image synthesis.

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

ControlCom: Controllable Image Composition using Diffusion Model

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Making Images Real Again: A Comprehensive Survey on Deep Image Composition

ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion

DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

Compositional Text-to-Image Generation with Dense Blob Representations

Progressive Compositionality In Text-to-Image Generative Models

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation