WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on

xujie zhang,Xiu Li,Michael Kampffmeyer,Xin Dong,Zhenyu Xie,Feida Zhu,Haoye Dong,Xiaodan Liang
2023-12-07
Abstract:Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on warping the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder the practical application of VITON techniques. Leveraging the notable success of diffusion-based models in cross-modal image synthesis, some recent diffusion-based methods have ventured to tackle this issue. However, they tend to either consume a significant amount of training resources or struggle to achieve realistic try-on effects and retain garment details. For efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the warping-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism. Specifically, WarpDiffusion incorporates local texture attention to reduce resource consumption and uses a novel auto-mask module that effectively retains only the critical areas of the warped garment while disregarding unrealistic or erroneous portions. Notably, WarpDiffusion can be integrated as a plug-and-play component into existing VITON methodologies, elevating their synthesis quality. Extensive experiments on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the superiority of WarpDiffusion, surpassing state-of-the-art methods both qualitatively and quantitatively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of the existing Virtual Try - On (VITON) technology in generating high - quality and highly realistic try - on effects. Specifically, although current methods can deform clothing images to adapt to the postures of target human bodies, they perform poorly in the following aspects: 1. **Synthesis quality at the boundary between clothing and skin**: Existing methods often overlook the detailed processing at the junction of clothing and skin, resulting in poor visual effects in these areas. 2. **Realistic try - on effects**: Details such as wrinkles and shadows are unnatural in the synthesis results, affecting the overall sense of realism. 3. **Consistency of visual quality**: Obvious artifacts or blurring occur in local areas (such as the junction of skin and clothing or the neck), reducing the quality of the entire image. To solve these problems, the paper proposes the **WarpDiffusion** model, which combines the advantages of explicit deformation modules and diffusion models. By introducing a new local texture attention mechanism and an automatic mask module, it improves the synthesis quality and resource utilization efficiency. The main contributions of WarpDiffusion include: - **Efficient and high - fidelity VITON synthesis**: By combining explicit deformation and diffusion models, WarpDiffusion can generate high - quality virtual try - on images while maintaining low resource consumption. - **Local texture attention mechanism**: This mechanism enhances the synthesis quality of body regions through local feature attention, significantly reducing the consumption of training resources. - **Automatic mask module**: This module extracts informative clothing features, generating more realistic try - on effects, such as wrinkles and shadows, while retaining details. - **Extensive experimental verification**: Experiments on multiple public VITON benchmark datasets and real - world test sets show that WarpDiffusion is qualitatively and quantitatively superior to the existing state - of - the - art methods, and can be integrated as a plug - and - play module into existing VITON methods to improve their synthesis quality. In conclusion, WarpDiffusion aims to solve the deficiencies of existing VITON methods in generating high - quality and highly realistic try - on images through innovative technical means, thereby promoting the practical application of virtual try - on technology in fields such as e - commerce.