Lazy Diffusion Transformer for Interactive Image Editing

Yotam Nitzan,Zongze Wu,Richard Zhang,Eli Shechtman,Daniel Cohen-Or,Taesung Park,Michaël Gharbi
2024-04-19
Abstract:We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inefficiency and waste of computational resources in existing image - editing methods in interactive application scenarios. Specifically, current inpainting methods usually regenerate the entire image, even if the user only needs to modify a small part of the image. This method is not only time - consuming but also wastes a large amount of computational resources, especially when dealing with high - resolution images or performing frequent local edits. ### Core Problems of the Paper 1. **Low Efficiency**: Existing image - inpainting methods based on diffusion models need to process the entire image, even if only a small area needs to be updated. This leads to unnecessary computational overhead. 2. **Loss of Global Context**: In order to improve efficiency, some methods only process a small rectangular area around the editing area, but doing so ignores the global image context, resulting in generated content being inconsistent with the overall image. 3. **Poor Interactivity**: For the above reasons, these methods perform poorly in real - time interactive editing, and the user experience is not smooth enough. ### Solutions of LazyDiffusion To solve these problems, the paper proposes the LazyDiffusion model, and its main innovations include: - **Phased Processing**: - **Encoder**: Process the entire visible image and binary mask to generate a compact global - context representation. This encoder runs only once and only for the current mask area. - **Decoder**: Based on the global context and the user's text prompt, only generate the pixel area covered by the mask. This can significantly reduce the amount of computation because most editing operations only involve a small part of the image area. - **Efficient Generation**: - LazyDiffusion avoids repeated processing of the entire image by limiting the generation process to the mask area. The running time of the decoder is proportional to the size of the mask, rather than the total size of the image, thus achieving higher efficiency. - **Maintaining Global Consistency**: - Although only the mask area is generated, LazyDiffusion can still utilize global - context information to ensure that the generated content is consistent with the overall image. This is due to the compressed - context representation generated by the encoder. ### Experimental Results The paper verifies the effectiveness of LazyDiffusion through experiments: - **Speed Improvement**: For typical interactive - editing tasks, the speed of LazyDiffusion is about 10 times faster than existing methods, especially when the mask area is small. - **Quality Assurance**: Although the efficiency is improved, the image quality generated by LazyDiffusion is comparable to, or even better than, the state - of - the - art inpainting methods, especially in terms of maintaining global consistency. In summary, this paper aims to solve the problems of inefficiency and waste of computational resources in existing image - editing methods in interactive applications by proposing a new diffusion - model architecture, while ensuring the quality and global consistency of the generated content.