CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

Hao-Yang Peng,Jia-Peng Zhang,Meng-Hao Guo,Yan-Pei Cao,Shi-Min Hu
2024-07-10
Abstract:In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the problem of efficiently generating high-quality 3D character models from a single image. Specifically, the paper tackles the following challenges: 1. **Complex Body Poses**: 3D character models typically include complex joint structures, leading to frequent self-occlusion in 2D images, which significantly complicates reconstruction, generation, and animation. 2. **Pose Ambiguity and Diversity**: Characters may take on various body poses, including some rare and difficult-to-interpret poses, resulting in a diverse but imbalanced data domain, further increasing the difficulty of generation, rigging, and animation. 3. **Limitations of Existing Methods**: Existing 3D generation techniques (such as those based on parametric models) are mainly suitable for realistic human proportions and tight clothing. These methods have limited effectiveness and adaptability for stylized characters with exaggerated body proportions and complex clothing designs. ### Solution To address the above issues, the paper proposes the **CharacterGen** framework, whose core innovations include: 1. **Multi-View Diffusion Model**: By aligning the input pose to a standard pose (e.g., "A-pose") while retaining key attributes of the input image, it effectively addresses the challenges posed by different poses. 2. **Transformer-Based Sparse View Reconstruction Model**: Utilizes multi-view images to generate detailed 3D models, simplifying the reconstruction process of geometry and texture. 3. **Texture Back-Projection Strategy**: Generates high-quality texture maps, ensuring the visual quality of the final model. ### Main Contributions 1. **Multi-View Consistent Image Generation Model**: Proposes a diffusion model conditioned on images that can generate multi-view consistent standard pose images from different input poses, addressing self-occlusion and pose ambiguity issues. 2. **Efficient 3D Reconstruction Pipeline**: Combines the multi-view image generation model and the transformer-based reconstruction model to achieve efficient conversion from single-view input to detailed 3D character models. 3. **Large-Scale Dataset**: Constructs a multi-view, multi-pose dataset (Anime3D) containing 13,746 anime characters, providing rich resources for model training and evaluation. ### Summary The CharacterGen framework effectively addresses the challenges of generating high-quality 3D character models from a single image by introducing a multi-view diffusion model and a transformer-based reconstruction model. It is particularly suitable for the generation, rigging, and animation of stylized characters.