Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow

Junhong Gou,Siyu Sun,Jianfu Zhang,Jianlou Si,Chen Qian,Liqing Zhang
DOI: https://doi.org/10.1145/3581783.3612255
2023-08-11
Abstract:Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of high-quality image synthesis in the task of Virtual Try-On. Specifically, the goal of virtual try-on is to transfer clothes from one image to a person in another image while preserving the details of both the clothes and the human body. Although many existing methods rely on Generative Adversarial Networks (GANs) to achieve this goal, there are still some shortcomings at high resolutions, such as loss of details and lack of realism. Recently, Diffusion Models have emerged as an alternative method for generating high-quality images. However, directly using clothes as a condition to guide the diffusion model for inpainting is insufficient to maintain the details of the clothes. To overcome this challenge, the authors propose an example-based inpainting method that utilizes a warping module to guide the effective generation of the diffusion model. The warping module initially processes the clothes, helping to retain the local details of the clothes. Then, the warped clothes are combined with the person image without clothes and noise is added as input to the diffusion model. Additionally, the warped clothes are used as local conditions in each denoising process to ensure that the output retains as much detail as possible. The main contributions of the paper include: 1. **Proposing a new framework**: Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), which effectively leverages the powerful generative capabilities of diffusion models. 2. **Introducing the warping module**: The warping module preprocesses the clothes to ensure the high quality and realism of the generated results. 3. **Experimental validation**: Experimental results on the VITON-HD dataset demonstrate the effectiveness and superiority of the proposed method. Through these innovations, the authors hope to generate high-quality and realistic composite images in the virtual try-on task, especially in high-resolution scenarios.