IMAGDressing-v1: Customizable Virtual Dressing

Fei Shen,Xin Jiang,Xin He,Hu Ye,Cong Wang,Xiaoyu Du,Zechao Li,Jinhui Tang
2024-08-06
Abstract:Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional faces, poses, and scenes. To address this issue, we define a virtual dressing (VD) task focused on generating freely editable human images with fixed garments and optional conditions. Meanwhile, we design a comprehensive affinity metric index (CAMI) to evaluate the consistency between generated images and reference garments. Then, we propose IMAGDressing-v1, which incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE. We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet, ensuring users can control different scenes through text. IMAGDressing-v1 can be combined with other extension plugins, such as ControlNet and IP-Adapter, to enhance the diversity and controllability of generated images. Furthermore, to address the lack of data, we release the interactive garment pairing (IGPair) dataset, containing over 300,000 pairs of clothing and dressed images, and establish a standard pipeline for data assembly. Extensive experiments demonstrate that our IMAGDressing-v1 achieves state-of-the-art human image synthesis performance under various controlled conditions. The code and model will be available at <a class="link-external link-https" href="https://github.com/muzishen/IMAGDressing" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Virtual Try - On (VTON) technology lacks flexibility and editing capabilities when presenting clothing, and cannot meet the needs of merchants to comprehensively display clothing. Specifically, the existing VTON technology mainly focuses on the local image inpainting task under given clothing and fixed human body conditions. Although this improves the online shopping experience of consumers, it ignores the various details that merchants need to flexibly control in clothing display, such as different faces, postures, and scenes. To make up for this deficiency, the paper defines a new virtual try - on task (Virtual Dressing, VD), aiming to generate freely editable portraits with fixed clothing and optional conditions, thereby providing a more comprehensive and personalized clothing display. ### Main Contributions 1. **Defined a new virtual try - on task (VD)**: In response to the needs of merchants, the task of generating freely editable portraits with fixed clothing and optional conditions is defined. 2. **Designed a comprehensive affinity metric (CAMI)**: Used to evaluate the consistency between the generated image and the reference clothing. 3. **Proposed the IMAGDressing - v1 model**: Combines a trainable clothing UNet and a frozen denoising UNet, and integrates clothing features and text - prompt control through a hybrid attention mechanism. 4. **Released a large - scale interactive clothing - pairing dataset (IGPair)**: Contains more than 300,000 pairs of clothing and wearing images, supporting community research. ### Solutions - **IMAGDressing - v1 model**: - **Clothing UNet**: Extracts semantic features from CLIP and texture features from VAE to capture fine - grained clothing features. - **Denoising UNet**: Integrates clothing features and text prompts through a hybrid attention mechanism to achieve scene control. - **Hybrid attention module**: Combines frozen self - attention and trainable cross - attention to balance the influence of clothing features and text prompts. - **Dataset**: - **IGPair dataset**: Contains high - resolution images, diverse scenes and styles, and detailed text descriptions, meeting the requirements of the VD task. ### Experimental Results - **Quantitative results**: IMAGDressing - v1 outperforms the existing SOTA methods in multiple evaluation metrics, especially in the comprehensive affinity metric (CAMI). - **Qualitative results**: IMAGDressing - v1 can not only faithfully reproduce text prompts, but also retain fine - grained clothing details, demonstrating superior performance in the VD task. ### Summary This paper solves the limitations of the existing VTON technology in merchant applications by defining a new virtual try - on task (VD) and proposing the IMAGDressing - v1 model, providing a more comprehensive and flexible clothing display solution. At the same time, the released IGPair dataset provides rich resources for related research.