FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

Chenhui Wang,Tao Chen,Zhihao Chen,Zhizhong Huang,Taoran Jiang,Qi Wang,Hongming Shan
2024-05-19
Abstract:Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of insufficient detail fidelity in the process of Virtual Try-On (VTON). Specifically, existing methods based on Latent Diffusion Models (LDM) can generate realistic high-resolution try-on images but perform poorly in preserving clothing details such as styles, patterns, and text. The main reasons are the randomness of LDM and the limitations of latent supervision. To mitigate these issues, the authors propose a new Faithful Latent Diffusion Model for Virtual Try-On (FLDM-VTON), with the following main improvements: 1. **Introducing deformed clothing as an initial point and local condition**: By utilizing the features of deformed clothing to provide prior information, the randomness in the sampling process is reduced. 2. **Introducing a clothing flattening network**: A new network is used to constrain the generated try-on images, ensuring clothing consistency and providing additional image-level constraints. 3. **Designing a clothing posterior sampling method**: The sampling strategy during inference is improved, further enhancing the model's performance. Experimental results show that FLDM-VTON outperforms existing baseline methods on the VITON-HD and Dress Code datasets, generating realistic try-on images while maintaining high fidelity of clothing details.