Abstract:Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the limitations of existing diffusion model-based image generation methods in producing high-quality images. These methods typically generate single-layer holistic images and lack object-level manipulation capabilities. To support a wider range of practical applications, such as professional graphic design and digital art creation, images often need to be created and manipulated in multiple layers to provide greater flexibility and control. Therefore, the paper proposes a layer collaborative diffusion model named **LayerDiff**, specifically designed for text-guided multi-layer composable image synthesis. The LayerDiff model captures patterns between different layers through the introduction of a multi-layer collaborative attention module and generates specific content for each layer through a text-guided internal attention module. Additionally, the model introduces a self-masked guided sampling strategy to further improve the quality of multi-layer image generation. Specifically, the LayerDiff model can generate images containing a background layer, multiple foreground layers, and their corresponding mask layers. In this way, users can finely control the content of each layer during the generation process, achieving more flexible and controllable image synthesis tasks. The paper also introduces a data construction pipeline for generating high-quality multi-layer composable image datasets to train the LayerDiff model. ### Main Contributions 1. **Layer Collaborative Diffusion Model**: Proposes a new layer collaborative diffusion model that achieves inter-layer information exchange through layer collaborative attention blocks and improves content generation accuracy through layer-specific prompt enhancement modules. 2. **Self-Masked Guided Sampling**: Proposes a self-masked guided sampling strategy that utilizes predicted layer masks to further optimize the generation results during the sampling process. 3. **Data Construction Pipeline**: Designs a data construction pipeline to generate high-quality multi-layer composable image datasets for training the LayerDiff model. 4. **High-Performance Generation**: Experimental results show that the LayerDiff model can generate high-fidelity multi-layer images with performance comparable to traditional holistic image generation methods and supports various controllable generation tasks, such as layer-specific image editing and style transfer.

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Text2Layer: Layered Image Generation using Latent Diffusion Model

LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Generative Image Layer Decomposition with Visual Effects

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

Collage Diffusion

TextDiffuser: Diffusion Models as Text Painters

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

GlyphDiffusion: Text Generation as Image Generation

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Progressive Text-to-Image Diffusion with Soft Latent Direction

Text-driven Visual Synthesis with Latent Diffusion Prior

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Transparent Image Layer Diffusion using Latent Transparency