Abstract:Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in local image editing: 1. **Simplify user input**: - Existing methods usually require users to provide precise masks or detailed area descriptions, which are both cumbersome and error - prone. For example, users need to provide a precise mask to specify the area to be edited, or describe the editing location in detail through natural language. - **Click2Mask** proposes a new method. Only a reference point (for example, through a mouse click) needs to be provided by the user, combined with a content description, to achieve local image editing. This method greatly simplifies the user input process. 2. **Improve editing precision and flexibility**: - Existing methods are often limited by the boundaries of existing objects or segmented areas when adding new content, resulting in the inability to freely add unconstrained new objects. - **Click2Mask** can flexibly add new objects without relying on existing segmentation by dynamically generating masks, and ensure the precision and context - relevance of the editing area. 3. **Improve the quality of editing results**: - Existing methods may produce unexpected results when dealing with complex scenes, such as global modification, editing not as expected, or incorrect modification of other objects. - **Click2Mask** uses Blended Latent Diffusion (BLD) and an Alpha - CLIP - based semantic loss function to ensure that the editing results not only conform to the user's intention but also have a high sense of reality and visual quality. 4. **Reduce user burden**: - Users no longer need to provide complex masks or detailed editing instructions. They can complete high - quality local image editing simply by clicking and providing a brief text description. In summary, **Click2Mask** mainly solves the problems of complex user input, poor editing flexibility, and uncontrollable editing results in existing local image editing methods, and provides a more concise, flexible, and high - quality solution.

Click2Mask: Local Editing with Dynamic Mask Generation

FocalClick: Towards Practical Interactive Image Segmentation.

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

Text-Guided Mask-free Local Image Retouching

SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches

PiClick: Picking the desired mask from multiple candidates in click-based interactive segmentation

Mask Editor : an Image Annotation Tool for Image Segmentation Tasks

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

MaskFaceGAN: High-Resolution Face Editing With Masked GAN Latent Code Optimization

MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

Image Synthesis from Layout with Locality-Aware Mask Adaption

PseudoClick: Interactive Image Segmentation with Click Imitation

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation

Mask-Guided Portrait Editing With Conditional GANs

FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis