D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

Zhaotong Yang,Zicheng Jiang,Xinzhe Li,Huiyu Zhou,Junyu Dong,Huaidong Zhang,Yong Du
2024-07-21
Abstract:In this paper, we introduce D$^4$-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that D$^4$-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues in the task of image-driven virtual try-on (VTON): 1. **Limitations in the Garment Deformation Stage**: - Current methods typically use Thin Plate Splines (TPS) or appearance flows for garment deformation. These techniques mainly focus on global alignment, neglecting local semantic changes, which leads to texture pattern distortion. - Although some methods mitigate this issue by segmenting garment regions, this approach relies on annotated data, making the training process time-consuming and difficult to define appropriate semantic regions. 2. **Complexity in the Synthesis Stage**: - Current methods usually employ Generative Adversarial Networks (GANs) or diffusion models for synthesis. GANs may produce unrealistic results, while diffusion models, although more stable, face optimization difficulties when handling tasks like denoising and inpainting simultaneously. - Existing methods often lack specific objectives for these tasks, resulting in ambiguity in learning the synthesis results. To address these issues, the paper proposes the D4-VTON model, which combines dynamic semantic disentangling techniques with a new paradigm based on a differential diffusion framework to achieve precise garment deformation and high-quality synthesis. Specifically, D4-VTON utilizes Dynamic Semantics Disentangling Modules (DSDMs) to independently learn local flows and introduces a Differential Information Tracking Path (DITP) to separate denoising and inpainting tasks, thereby reducing learning ambiguity and improving synthesis performance. Experimental results show that D4-VTON significantly outperforms existing methods in multiple benchmarks, demonstrating excellent performance in both quantitative and qualitative evaluations.