Abstract:Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment information. In this work, we argue that simply aligning texture and garment information is not sufficient to capture the semantics of the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information. Building on MaskCLIP, we propose ARMANI, a unified cross-modal fashion designer with part-level garment-text alignment. ARMANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image given the tokens of the control signals in its second stage. Contrary to prior approaches that also rely on two-stage paradigms, ARMANI introduces textual tokens into the codebook, making it possible for the model to utilize fine-grain semantic information to generate more realistic images. Further, by introducing a cross-modal Transformer, ARMANI is versatile and can accomplish image synthesis from various control signals, such as pure text, sketch images, and partial images. Extensive experiments conducted on our newly collected cross-modal fashion dataset demonstrate that ARMANI generates photo-realistic images in diverse synthesis tasks and outperforms existing state-of-the-art cross-modal image synthesis approaches.Our code is available at https://github.com/Harvey594/ARMANI.

Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework

BC-GAN: A Generative Adversarial Network for Synthesizing a Batch of Collocated Clothing

Learning to Synthesize Compatible Fashion Items Using Semantic Alignment and Collocation Classification: An Outfit Generation Framework

Clothing Generation by Multi-Modal Embedding: A Compatibility Matrix-Regularized GAN Model.

Collocating Clothes with Generative Adversarial Networks Cosupervised by Categories and Attributes: A Multidiscriminator Framework

COutfitGAN: Learning to Synthesize Compatible Outfits Supervised by Silhouette Masks and Fashion Styles

Toward AI fashion design: An Attribute-GAN model for clothing match

Towards Intelligent Design: A Self-Driven Framework for Collocated Clothing Synthesis Leveraging Fashion Styles and Textures

MGCM: Multi-modal Generative Compatibility Modeling for Clothing Matching

Toward Multi-Modal Conditioned Fashion Image Translation.

Poly-GAN: Multi-Conditioned GAN for Fashion Synthesis

Learning to Disentangle the Colors, Textures, and Shapes of Fashion Items: A Unified Framework

Spatially Constrained GAN for Face and Fashion Synthesis.

Multi-Garment Customized Model Generation

GAN-Based Garment Generation Using Sewing Pattern Images

ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval

Pose-Normalized and Appearance-Preserved Street-to-Shop Clothing Image Generation and Feature Learning

CascadeGAN: A Category-Supervised Cascading Generative Adversarial Network for Clothes Translation from the Human Body to Tiled Images

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits Via Fashion Compatibility Boosting

Learning to Synthesize Fashion Textures