TryOffAnyone: Tiled Cloth Generation from a Dressed Person

Ioannis Xarchakos,Theodoros Koukopoulos
2024-12-12
Abstract:The fashion industry is increasingly leveraging computer vision and deep learning technologies to enhance online shopping experiences and operational efficiencies. In this paper, we address the challenge of generating high-fidelity tiled garment images essential for personalized recommendations, outfit composition, and virtual try-on systems from photos of garments worn by models. Inspired by the success of Latent Diffusion Models (LDMs) in image-to-image translation, we propose a novel approach utilizing a fine-tuned StableDiffusion model. Our method features a streamlined single-stage network design, which integrates garmentspecific masks to isolate and process target clothing items effectively. By simplifying the network architecture through selective training of transformer blocks and removing unnecessary crossattention layers, we significantly reduce computational complexity while achieving state-of-the-art performance on benchmark datasets like VITON-HD. Experimental results demonstrate the effectiveness of our approach in producing high-quality tiled garment images for both full-body and half-body inputs. Code and model are available at: <a class="link-external link-https" href="https://github.com/ixarchakos/try-off-anyone" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of generating high - quality, tiled clothing images from photos of models wearing clothes. Specifically, the authors propose a novel method that uses a fine - tuned StableDiffusion model to generate these tiled clothing images. This method is crucial for application scenarios such as personalized recommendation, collocation advice, and virtual fitting systems. #### Background and Challenges As the fashion industry increasingly adopts computer vision and deep - learning technologies to enhance the online shopping experience and operational efficiency, generating high - quality tiled clothing images has become particularly important. However, many current online shopping platforms only display photos of models wearing clothes and lack tiled views, which limits the improvement of user experience. Obtaining additional tiled images is both expensive and time - consuming, which is a major obstacle for retailers. #### Solutions To solve this problem, the authors propose the following innovations: 1. **Single - stage network design**: Simplify the network architecture. By selectively training Transformer blocks and removing unnecessary cross - attention layers, the computational complexity is significantly reduced. 2. **Clothing mask**: Introduce clothing - specific masks to isolate and process the target clothing item, thereby improving the generation quality. 3. **Based on the pre - trained StableDiffusion model**: Utilize the pre - trained StableDiffusion v1.5 model and fine - tune it to make it specifically optimized for generating high - fidelity tiled clothing images. 4. **Reduce trainable parameters**: By only training the Transformer blocks in U - Net, the trainable parameters are reduced from 815.45M to 267.24M, greatly reducing the memory requirements and computational resource consumption. #### Experimental Results The experimental results show that this method achieves state - of - the - art performance on the VITON - HD benchmark dataset and can generate high - quality tiled clothing images for full - body and half - body input images. In addition, the authors also conducted a detailed ablation study to verify the effectiveness of the method under different configurations and analyzed the influence of the number of seeds on the quality and consistency of the generated images. ### Summary In general, this paper solves a key technical problem in the fashion industry by proposing an efficient and high - quality method for generating tiled clothing images, providing strong support for personalized recommendation, collocation advice, and virtual fitting systems.