Abstract:Image composition involves seamlessly integrating given objects into a specific visual context. Current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only impedes their swift implementation but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related tokens to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues in the task of image synthesis: 1. **Maintaining Object Appearance**: Existing training-free methods struggle to maintain the appearance features of complex objects during synthesis, especially when the background generation process overly focuses on the background, leading to a decline in the quality of foreground generation. 2. **Natural Coherence**: Current methods find it difficult to achieve natural coherence when synthesizing images, particularly in transition areas where unnecessary artifacts are easily introduced. 3. **Computational Efficiency**: Existing methods require a large amount of computational resources for image synthesis tasks and suffer from inefficiencies during the synthesis process. To tackle these challenges, the paper proposes a new method—**PrimeComposer**, which uses a carefully designed attention-guided mechanism to progressively synthesize images at different noise levels, achieving efficient and high-quality image synthesis. Specifically, PrimeComposer addresses the issues through the following aspects: - **Redefinition of Local Editing Problem**: Redefining the image synthesis task as an object-centered local editing problem, focusing only on foreground generation and avoiding unnecessary background modifications. - **Correlation Diffuser (CD)**: Utilizing self-attention layers to capture the interrelationship between objects and backgrounds, incorporating this prior information into the generation process to ensure the preservation of object appearance and the establishment of natural coherence. - **Region-constrained Cross-Attention (RCA)**: Restricting the influence range of specific object-related tokens to reduce artifacts in transition areas and improve the overall consistency of the synthesized image. - **Extended Classifier-free Guidance (CFG)**: Extending the guidance effect at each sampling step to enhance the harmony of the generated image. Through these innovations, PrimeComposer demonstrates outstanding performance in multiple domains (such as oil painting, sketching, cartoon animation, and hyper-realistic photography), surpassing existing methods in both qualitative and quantitative evaluations, and also excelling in inference efficiency.

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

ControlCom: Controllable Image Composition using Diffusion Model

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Salience-preserving image composition with luminance consistency

Image Harmonization with Diffusion Model

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Composite Diffusion | whole >= Σparts

Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Composer: Creative and Controllable Image Synthesis with Composable Conditions

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Natural and seamless image composition with color control.

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Making Images Real Again: A Comprehensive Survey on Deep Image Composition

MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior

Painterly Image Harmonization using Diffusion Model