Abstract:Multimodal clothing image editing refers to the precise adjustment and modification of clothing images using data such as textual descriptions and visual images as control conditions, which effectively improves the work efficiency of designers and reduces the threshold for user design. In this paper, we propose a new image editing method ControlEdit, which transfers clothing image editing to multimodal-guided local inpainting of clothing images. We address the difficulty of collecting real image datasets by leveraging the self-supervised learning approach. Based on this learning approach, we extend the channels of the feature extraction network to ensure consistent clothing image style before and after editing, and we design an inverse latent loss function to achieve soft control over the content of non-edited areas. In addition, we adopt Blended Latent Diffusion as the sampling method to make the editing boundaries transition naturally and enforce consistency of non-edited area content. Extensive experiments demonstrate that ControlEdit surpasses baseline algorithms in both qualitative and quantitative evaluations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve multi - modal local editing in clothing image editing, that is, to precisely adjust and modify clothing images by using text descriptions and visual images as control conditions. Specifically, the paper aims to overcome the following challenges: 1. **Difficulty in dataset collection**: It is difficult to obtain a sufficient number of real - clothing images and their modified paired images for training. 2. **Maintenance and transition of non - edited areas**: During the editing process, maintaining the content integrity and natural transition of non - edited areas is a significant challenge. 3. **Complexity of the generation task**: Different from traditional image - translation tasks, clothing image editing not only requires domain conversion from sketches to real - object images, but also needs to reasonably fuse the generated real - object images with the source real - object images. To solve these problems, the paper proposes a new image - editing method - ControlEdit. This method reduces the dependence on real - image datasets by using self - supervised learning methods and ensures the consistency of the image styles before and after editing by expanding the number of channels in the feature - extraction network. In addition, the paper designs an inverse latent - loss function to achieve soft control of the content in non - edited areas and adopts a mixed latent - diffusion sampling method to make the editing boundaries transition naturally and enforce the consistency of the content in non - edited areas. The main contributions of the paper include: 1. Proposing the multi - modal local - editing method ControlEdit based on ControlNet, which uses sketches, natural languages and masked source images to guide image generation. 2. Designing an inverse latent - loss function, optimizing the original ControlNet loss function and promoting the consistency of the content in non - edited areas. 3. Performing a masked - fusion operation on the generated features and the source - image features in each inference step in the latent space, avoiding the problems of unnatural masked - transition in the pixel space and inconsistent styles. 4. Experimental results show that ControlEdit exhibits better image - generation quality than the baseline models in the benchmark tests. These contributions together solve the key challenges in multi - modal clothing - image editing and improve the controllability, authenticity and rationality of the editing results.

ControlEdit: A MultiModal Local Clothing Image Editing Method

DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

InsightEdit: Towards Better Instruction Following for Image Editing

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Image-based Clothes Changing System

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

Edit Like A Designer

LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

StyleBooth: Image Style Editing with Multimodal Instruction

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea