ControlEdit: A MultiModal Local Clothing Image Editing Method

Di Cheng,YingJie Shi,ShiXin Sun,JiaFu Zhang,WeiJing Wang,Yu Liu
2024-09-23
Abstract:Multimodal clothing image editing refers to the precise adjustment and modification of clothing images using data such as textual descriptions and visual images as control conditions, which effectively improves the work efficiency of designers and reduces the threshold for user design. In this paper, we propose a new image editing method ControlEdit, which transfers clothing image editing to multimodal-guided local inpainting of clothing images. We address the difficulty of collecting real image datasets by leveraging the self-supervised learning approach. Based on this learning approach, we extend the channels of the feature extraction network to ensure consistent clothing image style before and after editing, and we design an inverse latent loss function to achieve soft control over the content of non-edited areas. In addition, we adopt Blended Latent Diffusion as the sampling method to make the editing boundaries transition naturally and enforce consistency of non-edited area content. Extensive experiments demonstrate that ControlEdit surpasses baseline algorithms in both qualitative and quantitative evaluations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve multi - modal local editing in clothing image editing, that is, to precisely adjust and modify clothing images by using text descriptions and visual images as control conditions. Specifically, the paper aims to overcome the following challenges: 1. **Difficulty in dataset collection**: It is difficult to obtain a sufficient number of real - clothing images and their modified paired images for training. 2. **Maintenance and transition of non - edited areas**: During the editing process, maintaining the content integrity and natural transition of non - edited areas is a significant challenge. 3. **Complexity of the generation task**: Different from traditional image - translation tasks, clothing image editing not only requires domain conversion from sketches to real - object images, but also needs to reasonably fuse the generated real - object images with the source real - object images. To solve these problems, the paper proposes a new image - editing method - ControlEdit. This method reduces the dependence on real - image datasets by using self - supervised learning methods and ensures the consistency of the image styles before and after editing by expanding the number of channels in the feature - extraction network. In addition, the paper designs an inverse latent - loss function to achieve soft control of the content in non - edited areas and adopts a mixed latent - diffusion sampling method to make the editing boundaries transition naturally and enforce the consistency of the content in non - edited areas. The main contributions of the paper include: 1. Proposing the multi - modal local - editing method ControlEdit based on ControlNet, which uses sketches, natural languages and masked source images to guide image generation. 2. Designing an inverse latent - loss function, optimizing the original ControlNet loss function and promoting the consistency of the content in non - edited areas. 3. Performing a masked - fusion operation on the generated features and the source - image features in each inference step in the latent space, avoiding the problems of unnatural masked - transition in the pixel space and inconsistent styles. 4. Experimental results show that ControlEdit exhibits better image - generation quality than the baseline models in the benchmark tests. These contributions together solve the key challenges in multi - modal clothing - image editing and improve the controllability, authenticity and rationality of the editing results.