FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion

Abhishek Kumar Singh,Ioannis Patras
2024-04-26
Abstract:The rapid evolution of the fashion industry increasingly intersects with technological advancements, particularly through the integration of generative AI. This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models. Utilizing ControlNet and LoRA fine-tuning, our approach generates high-quality images from multimodal inputs such as text and sketches. We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data. Our evaluation, utilizing metrics like FID, CLIP Score, and KID, demonstrates that our model significantly outperforms traditional stable diffusion models. The results not only highlight the effectiveness of our model in generating fashion-appropriate outputs but also underscore the potential of diffusion models in revolutionizing fashion design workflows. This research paves the way for more interactive, personalized, and technologically enriched methodologies in fashion design and representation, bridging the gap between creative vision and practical application.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper proposes a solution to the problem of innovative design in the fashion industry, specifically helping designers transform their ideas into images. The researchers introduce a new pipeline based on a latent diffusion model, combining ControlNet and LoRA fine-tuning, capable of generating high-quality images based on text descriptions and sketches. They utilize the virtual try-on dataset Multimodal Dress Code and VITON-HD, and extend these datasets to include sketches. By conducting comparative experiments and evaluating metrics such as FID, CLIP Score, and KID, the paper demonstrates that the proposed model outperforms traditional stable diffusion models in generating detailed and realistic clothing images that match the input conditions. This approach has the potential to enhance interactivity, personalization, and technical value in fashion design, making it suitable for applications such as automated design. In summary, the main contributions of the paper include: 1. Implementation of a novel pipeline based on stable diffusion, LoRA, and ControlNet for fashion clothing generation guided by multimodal inputs such as text and sketches. 2. Introduction of a new generation model tailored for fashion designers, utilizing a latent diffusion model for conditional modeling. 3. Expansion of the virtual try-on dataset by adding sketch information and proposing a new evaluation metric to measure the structural similarity between generated images and input sketches. This work builds upon existing research in fields like text-to-image synthesis, sketch-based image generation, and ControlNet, bringing new technological advancements to the fashion design industry.