Click2Mask: Local Editing with Dynamic Mask Generation

Omer Regev,Omri Avrahami,Dani Lischinski
2024-09-13
Abstract:Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in local image editing: 1. **Simplify user input**: - Existing methods usually require users to provide precise masks or detailed area descriptions, which are both cumbersome and error - prone. For example, users need to provide a precise mask to specify the area to be edited, or describe the editing location in detail through natural language. - **Click2Mask** proposes a new method. Only a reference point (for example, through a mouse click) needs to be provided by the user, combined with a content description, to achieve local image editing. This method greatly simplifies the user input process. 2. **Improve editing precision and flexibility**: - Existing methods are often limited by the boundaries of existing objects or segmented areas when adding new content, resulting in the inability to freely add unconstrained new objects. - **Click2Mask** can flexibly add new objects without relying on existing segmentation by dynamically generating masks, and ensure the precision and context - relevance of the editing area. 3. **Improve the quality of editing results**: - Existing methods may produce unexpected results when dealing with complex scenes, such as global modification, editing not as expected, or incorrect modification of other objects. - **Click2Mask** uses Blended Latent Diffusion (BLD) and an Alpha - CLIP - based semantic loss function to ensure that the editing results not only conform to the user's intention but also have a high sense of reality and visual quality. 4. **Reduce user burden**: - Users no longer need to provide complex masks or detailed editing instructions. They can complete high - quality local image editing simply by clicking and providing a brief text description. In summary, **Click2Mask** mainly solves the problems of complex user input, poor editing flexibility, and uncontrollable editing results in existing local image editing methods, and provides a more concise, flexible, and high - quality solution.