Abstract:We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inefficiency and waste of computational resources in existing image - editing methods in interactive application scenarios. Specifically, current inpainting methods usually regenerate the entire image, even if the user only needs to modify a small part of the image. This method is not only time - consuming but also wastes a large amount of computational resources, especially when dealing with high - resolution images or performing frequent local edits. ### Core Problems of the Paper 1. **Low Efficiency**: Existing image - inpainting methods based on diffusion models need to process the entire image, even if only a small area needs to be updated. This leads to unnecessary computational overhead. 2. **Loss of Global Context**: In order to improve efficiency, some methods only process a small rectangular area around the editing area, but doing so ignores the global image context, resulting in generated content being inconsistent with the overall image. 3. **Poor Interactivity**: For the above reasons, these methods perform poorly in real - time interactive editing, and the user experience is not smooth enough. ### Solutions of LazyDiffusion To solve these problems, the paper proposes the LazyDiffusion model, and its main innovations include: - **Phased Processing**: - **Encoder**: Process the entire visible image and binary mask to generate a compact global - context representation. This encoder runs only once and only for the current mask area. - **Decoder**: Based on the global context and the user's text prompt, only generate the pixel area covered by the mask. This can significantly reduce the amount of computation because most editing operations only involve a small part of the image area. - **Efficient Generation**: - LazyDiffusion avoids repeated processing of the entire image by limiting the generation process to the mask area. The running time of the decoder is proportional to the size of the mask, rather than the total size of the image, thus achieving higher efficiency. - **Maintaining Global Consistency**: - Although only the mask area is generated, LazyDiffusion can still utilize global - context information to ensure that the generated content is consistent with the overall image. This is due to the compressed - context representation generated by the encoder. ### Experimental Results The paper verifies the effectiveness of LazyDiffusion through experiments: - **Speed Improvement**: For typical interactive - editing tasks, the speed of LazyDiffusion is about 10 times faster than existing methods, especially when the mask area is small. - **Quality Assurance**: Although the efficiency is improved, the image quality generated by LazyDiffusion is comparable to, or even better than, the state - of - the - art inpainting methods, especially in terms of maintaining global consistency. In summary, this paper aims to solve the problems of inefficiency and waste of computational resources in existing image - editing methods in interactive applications by proposing a new diffusion - model architecture, while ensuring the quality and global consistency of the generated content.

Lazy Diffusion Transformer for Interactive Image Editing

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models

Diffusion Brush: A Latent Diffusion Model-based Editing Tool for AI-generated Images

TurboEdit: Instant text-based image editing

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Differential Diffusion: Giving Each Pixel Its Strength

DiT4Edit: Diffusion Transformer for Image Editing

Streamlining Image Editing with Layered Diffusion Brushes

Editable Image Elements for Controllable Synthesis

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

Dynamic Diffusion Transformer

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference