Abstract:Recent developments in deep generative models have opened up a wide range of opportunities for image synthesis, leading to significant changes in various creative fields, including the fashion industry. While numerous methods have been proposed to benefit buyers, particularly in virtual try-on applications, there has been relatively less focus on facilitating fast prototyping for designers and customers seeking to order new designs. To address this gap, we introduce DiCTI (Diffusion-based Clothing Designer via Text-guided Input), a straightforward yet highly effective approach that allows designers to quickly visualize fashion-related ideas using text inputs only. Given an image of a person and a description of the desired garments as input, DiCTI automatically generates multiple high-resolution, photorealistic images that capture the expressed semantics. By leveraging a powerful diffusion-based inpainting model conditioned on text inputs, DiCTI is able to synthesize convincing, high-quality images with varied clothing designs that viably follow the provided text descriptions, while being able to process very diverse and challenging inputs, captured in completely unconstrained settings. We evaluate DiCTI in comprehensive experiments on two different datasets (VITON-HD and Fashionpedia) and in comparison to the state-of-the-art (SoTa). The results of our experiments show that DiCTI convincingly outperforms the SoTA competitor in generating higher quality images with more elaborate garments and superior text prompt adherence, both according to standard quantitative evaluation measures and human ratings, generated as part of a user study.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the communication issues between designers and clients in rapid prototyping, especially when customizing new clothing designs. While many current methods focus on enhancing consumer experience through applications like virtual try-ons, there is relatively less attention on the need for designers to quickly visualize fashion concepts. To fill this gap, the authors propose DiCTI (Diffusion-based Text-guided Clothing Design Model), which allows designers to quickly generate high-quality, realistic clothing images using only text input. Specifically, the goals of DiCTI are: 1. **Rapid Visualization of Fashion Concepts**: Designers can quickly generate multiple high-resolution, realistic clothing images through simple text descriptions. 2. **Improving Design Efficiency**: By automating the generation process, it reduces the time and effort required by designers in the initial design phase. 3. **Enhancing User Engagement**: Consumers can describe their needs through text, communicate more effectively with designers, or search for similar designs on the internet. ### Solution Overview DiCTI utilizes pre-trained diffusion models and text-guided techniques to achieve the generation of high-quality images from text descriptions. The main steps include: 1. **Mask Generation Module**: Generates binary masks for the body and face to guide subsequent image editing. 2. **Clothing Synthesis Module**: Generates new clothing designs in the masked areas based on text descriptions. 3. **Identity Preservation Module**: Ensures that facial features in the generated images remain consistent with the original image. ### Experiments and Results The authors conducted comprehensive experiments on two different datasets (VITON-HD and Fashionpedia) and compared DiCTI with existing state-of-the-art methods (such as FICE). The experimental results show that DiCTI outperforms FICE in terms of image quality and text description consistency. Specifically, it excels in the following aspects: - **Image Quality**: DiCTI generates more realistic and detailed images. - **Text Description Consistency**: DiCTI better follows text descriptions to generate the required clothing designs. - **Identity Preservation**: DiCTI performs excellently in preserving facial features, especially in terms of skin tone consistency. ### Conclusion DiCTI provides designers and consumers with an efficient and intuitive tool to quickly generate high-quality clothing design images. Through text-guided image editing, DiCTI not only improves design efficiency but also enhances user engagement and satisfaction.

DiCTI: Diffusion-based Clothing Designer via Text-guided Input

Image Reference-guided Fashion Design with Structure-aware Transfer by Diffusion Models.

FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

A Two-stage Personalized Virtual Try-on Framework with Shape Control and Texture Guidance

DressCode: Autoregressively Sewing and Generating Garments from Text Guidance

New Fashion: Personalized 3D Design with a Single Sketch Input

PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow

IMAGDressing-v1: Customizable Virtual Dressing

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Garment3DGen: 3D Garment Stylization and Texture Generation

Improving Diffusion Models for Virtual Try-on

DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment

Quality and Quantity: Unveiling a Million High-Quality Images for Text-to-Image Synthesis in Fashion Design

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models