Controllable Human Image Generation with Personalized Multi-Garments

Yisol Choi,Sangkyung Kwak,Sihyun Yu,Hyungwon Choi,Jinwoo Shin
2024-11-25
Abstract:We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve controllable generation of multiple pieces of clothing when generating human images. Specifically, existing methods face several challenges when generating human images wearing multiple reference pieces of clothing: 1. **Difficulty in data acquisition**: It is very difficult to collect large - scale, high - quality image datasets of a single person wearing multiple specific pieces of clothing. Ideally, it is necessary to manually collect photos of each piece of clothing worn by each person, which is almost an impossible task in practical operations. 2. **Low generation quality**: Images generated by existing methods often have copy - paste problems, that is, the generated clothing is exactly the same as the clothing in the reference image and cannot be appropriately adjusted according to the pose or appearance of the person. In addition, there are also problems of subject fusion or inconsistency, especially when generating human images with diverse poses. 3. **Limited generalization ability**: Existing models often cannot generate natural and harmonious images when dealing with complex clothing combinations, such as wearing a swimsuit and football shoes at the same time, and lack accurate retention of details. To solve the above problems, this paper proposes the BootComp framework, which improves the quality and diversity of generated images through two - stage methods of synthetic data generation and combination module training. The main contributions of BootComp include: - **Synthetic data generation**: By introducing a decomposition module, clothing images from the product perspective are extracted from single - person images wearing clothing, and these synthetic data are used to train the generation model. To ensure data quality, a filtering strategy based on perceptual similarity is also proposed to remove low - quality generated data. - **Combination module**: Two diffusion models are used, one as an image encoder to extract clothing features and the other as a generator to create human images. By keeping the generator frozen, BootComp can be easily integrated with other modules, providing more application possibilities, such as pose control, stylized generation, and personalized generation, etc. Through this framework, BootComp can not only generate high - quality multi - clothing human images but also show broad application potential in multiple application scenarios such as virtual fitting and personalized generation.