Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Alberto Baldrati,Davide Morelli,Giuseppe Cartella,Marcella Cornia,Marco Bertini,Rita Cucchiara
2023-08-23
Abstract:Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: <a class="link-external link-https" href="https://github.com/aimagelab/multimodal-garment-designer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to generate high - quality fashion images that meet practical requirements through multimodal inputs (such as text, human body postures, and clothing sketches), while retaining the identity and body features of the original model. Specifically, the author proposes a new task - **fashion image editing under multimodal conditions**, and develops a new architecture named **Multimodal Garment Designer (MGD)** for this purpose. ### Problem Background In the field of fashion design, designers usually use illustrations to convey their design concepts and transform these concepts from the abstract to the real. Computer vision technology can help improve this process, especially in aspects such as virtual fitting. However, most of the existing methods focus on virtual fitting, and pay less attention to how to generate new fashion images based on multimodal inputs (such as text descriptions, human body postures, and clothing sketches). ### Research Objectives The author aims to solve the following problems: 1. **Generate high - quality fashion images**: Ensure that the generated images are not only realistic but also consistent with the given multimodal inputs (such as text descriptions, human body postures, and clothing sketches). 2. **Retain the identity and body features of the original model**: When replacing clothing, ensure that the identity and body shape of the model are not changed. 3. **Provide more flexible design tools**: By introducing multimodal inputs, enable designers to control the generated clothing details more precisely. ### Solutions To solve the above problems, the author proposes the following solutions: - **New task definition**: Define the task of fashion image editing under multimodal conditions, allowing the use of multiple modal inputs such as text, human body postures, and clothing sketches to guide image generation. - **New architecture design**: Introduce the multimodal generation architecture MGD based on Latent Diffusion Models (LDMs), which can directly use multimodal information such as text descriptions, human body postures, and clothing sketches for image generation. - **Dataset expansion**: To support the new task, the author expands two existing fashion datasets (Dress Code and VITON - HD), adding multimodal annotations such as text descriptions and clothing sketches. ### Experimental Results The experimental results show that the proposed MGD architecture outperforms existing competitors in multiple evaluation metrics (such as FID, KID, CLIP - S, PD, and SD), especially in terms of the authenticity of the generated images and the consistency with multimodal inputs. In conclusion, this paper solves the deficiencies of existing methods in generating high - quality and practical - requirement - compliant fashion images by introducing the task of fashion image editing under multimodal conditions and the corresponding generation architecture.