Shoukang Hu,Fangzhou Hong,Tao Hu,Liang Pan,Haiyi Mei,Weiye Xiao,Lei Yang,Ziwei Liu
Abstract:3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such as underwear, outerwear, trousers, shoes, etc. In this work, we propose HumanLiff, the first layer-wise 3D human generative model with a unified diffusion process. Specifically, HumanLiff firstly generates minimal-clothed humans, represented by tri-plane features, in a canonical space, and then progressively generates clothes in a layer-wise manner. In this way, the 3D human generation is thus formulated as a sequence of diffusion-based 3D conditional generation. To reconstruct more fine-grained 3D humans with tri-plane representation, we propose a tri-plane shift operation that splits each tri-plane into three sub-planes and shifts these sub-planes to enable feature grid subdivision. To further enhance the controllability of 3D generation with 3D layered conditions, HumanLiff hierarchically fuses tri-plane features and 3D layered conditions to facilitate the 3D diffusion model learning. Extensive experiments on two layer-wise 3D human datasets, SynBody (synthetic) and TightCap (real-world), validate that HumanLiff significantly outperforms state-of-the-art methods in layer-wise 3D human generation. Our code will be available at <a class="link-external link-https" href="https://skhu101.github.io/HumanLiff" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more finely control the generation of different levels of clothing when generating 3D clothed humans. Existing 3D human generation models usually generate a human model wearing complete clothing all at once, and rarely consider the layering of human clothing, such as underwear, outerwear, pants, shoes, etc. This one - time generation method has certain limitations when users hope to control the generation process of each level. For example, in virtual reality (VR) or augmented reality (AR) applications, users may want to create game characters layer by layer, first generate a basically clothed human body, and then gradually select or generate pants, tops, and shoes, etc.
To solve this problem, the paper proposes **HumanLiff**, which is the first hierarchical 3D human generation model using the Diffusion Model. The main contributions of HumanLiff are as follows:
1. **Hierarchical 3D human generation**: Through the diffusion model, HumanLiff can generate the human body and its various layers of clothing step by step, allowing users to freely control the generation process of the human body and each layer of clothing.
2. **Tri - plane representation and tri - plane shift operation**: In order to reconstruct a more detailed 3D human model, the paper proposes a tri - plane representation (Tri - plane Representation) and a tri - plane shift operation (Tri - plane Shift). The tri - plane representation represents information in 3D space through three vertical planes, and the tri - plane shift operation divides each tri - plane into three sub - planes and moves these sub - planes, so that 3D points projected onto the same area can extract different features, thereby improving the detail expressiveness of the model.
3. **3D conditional fusion**: In order to better control the 3D generation process, HumanLiff fuses multi - scale 3D conditional features with the output of the diffusion UNet decoder layer by layer through a 3D conditional UNet encoder, ensuring the retention of the information of the previous layer of clothing during the generation process.
Through these innovations, the experimental results of HumanLiff on two hierarchical 3D human datasets - SynBody (synthetic dataset) and TightCap (real - world dataset) show that it is significantly superior to existing 3D GAN and diffusion model methods in the hierarchical 3D human generation task. This not only promotes the development of 3D human generation technology, but also provides new possibilities for personalized and interactive 3D content creation in practical applications.