Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan,Yehao Li,Jingwen Chen,Yingwei Pan,Ting Yao,Yang Cao,Tao Mei
2024-09-13
Abstract:Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{<a class="link-external link-https" href="https://github.com/siqi0905/GarDiff/tree/master" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/siqi0905/GarDiff/tree/master" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?