Xujie Zhang,Binbin Yang,Michael C. Kampffmeyer,Wenqing Zhang,Shiyue Zhang,Guansong Lu,Liang Lin,Hang Xu,Xiaodan Liang
Abstract:Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces.Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve fine - grained part - level semantic alignment in cross - modal clothing synthesis and manipulation. Specifically, when generating clothing images, existing methods have difficulty in precisely aligning each part in the input text description with the generated image, resulting in two main problems:
1. **Garment Part Leakage**: One or more clothing parts mentioned in the text description are missing in the generated image. For example, when generating a jacket with pockets, the model may not generate the pockets.
2. **Attribute Confusion**: In the generated image, attributes and clothing parts are wrongly paired, or some attributes are ignored. For example, a "blue shirt" mentioned in the text description may be generated as a shirt with a "brown collar", or the "pure white" attribute is ignored in a striped shirt.
The root cause of these problems lies in the fact that existing methods ignore the structural correspondence between visual and text representations. To overcome these problems, the paper proposes **DiffCloth**, a cross - modal clothing synthesis and manipulation framework based on the diffusion model, which improves the accuracy of generation and manipulation through structured semantic alignment.
### Main Contributions
1. **Structural Semantic Consensus Guidance**:
- A structured semantic alignment method is proposed, which models the structural correspondence between visual and text representations as a bipartite graph matching problem and uses the Hungarian algorithm for optimization.
- The Hungarian matching loss \( L_{\text{Hungarian}} \) is introduced to guide the diffusion model to maintain the structural consistency between the image and the text during the generation process.
2. **Semantic - bundled Cross - attention**:
- A new semantic - bundled cross - attention mechanism is proposed. By minimizing the spatial structure difference of the attention maps between attribute adjectives and clothing part nouns, the attribute confusion problem is avoided.
- The semantic - bundled loss \( L_{\text{bundle}} \) is defined to maintain the spatial structure similarity between attributes and parts during the generation process.
3. **Region Consistency Mechanism**:
- A region consistency mechanism is introduced to prevent the modification of regions unrelated to the text description during the manipulation process.
- A mixed mask of the attention map is used to identify and protect regions unrelated to the manipulation, ensuring the locality and accuracy of the manipulation.
### Method Overview
#### 3.1 Preliminaries
**Stable Diffusion** is the basis of DiffCloth and consists of an auto - encoder and a diffusion model. The auto - encoder encodes the image into a low - resolution latent representation, and the diffusion model converts the latent representation into a normal distribution by gradually adding noise, and then generates an image through the denoising process.
#### 3.2 Structural Semantic Consensus Guidance
- **Visual Structure Components**: A segmenter is used to divide the clothing image into multiple parts, such as sleeves, body, hat, etc.
- **Text Structure Components**: All attribute phrases (AP) are extracted through syntactic parsing, such as "blue sweater", "classic hood", "long sleeves", etc.
- **Bipartite Graph Matching**: The semantic alignment problem between visual and text components is modeled as a bipartite graph matching problem. The Hungarian algorithm is used to find the optimal match and calculate the Hungarian matching loss \( L_{\text{Hungarian}} \).
#### 3.3 Semantic - bundled Cross - attention
- **Attribute Phrase Extraction**: Attribute phrases are extracted from the input text.
- **Attention Map Similarity**: By minimizing the spatial structure difference of the attention maps between attribute adjectives and clothing part nouns, the semantic - bundled loss \( L_{\text{bundle}} \) is defined.
#### 3.4 Region Consistency Mechanism
- **Dynamic Threshold**: The first quartile of the pixel activation values in the attention map is selected as the dynamic threshold to generate a binary mask.
- **Mixed Mask**: A mixed mask is used to identify and protect regions unrelated to the operation, ensuring...