Abstract:Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods. Our project page is https://yumingj.github.io/projects/Text2Human.html. Code and pretrained models are available at https://github.com/yumingj/Text2Human.

AtHom: Two Divergent Attentions Stimulated by Homomorphic Training in Text-to-Image Synthesis

Diversified text-to-image generation via deep mutual information estimation

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Exploring coherence from heterogeneous representations for OCR image captioning

Text-to-image Generation Based on Spatial-Channel Attention and Semantic Redescription

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Dual Attention GANs for Semantic Image Synthesis

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Multi-Stage Hybrid Text-to-Image Generation Models.

Dual Semantic Relationship Attention Network for Image-Text Matching

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Text2Human

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Text2Human: Text-Driven Controllable Human Image Generation.

DGattGAN: Cooperative Up-Sampling Based Dual Generator Attentional GAN on Text-to-Image Synthesis

SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation

CosmicMan: A Text-to-Image Foundation Model for Humans

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis