Abstract:In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the problem of efficiently generating high-quality 3D character models from a single image. Specifically, the paper tackles the following challenges: 1. **Complex Body Poses**: 3D character models typically include complex joint structures, leading to frequent self-occlusion in 2D images, which significantly complicates reconstruction, generation, and animation. 2. **Pose Ambiguity and Diversity**: Characters may take on various body poses, including some rare and difficult-to-interpret poses, resulting in a diverse but imbalanced data domain, further increasing the difficulty of generation, rigging, and animation. 3. **Limitations of Existing Methods**: Existing 3D generation techniques (such as those based on parametric models) are mainly suitable for realistic human proportions and tight clothing. These methods have limited effectiveness and adaptability for stylized characters with exaggerated body proportions and complex clothing designs. ### Solution To address the above issues, the paper proposes the **CharacterGen** framework, whose core innovations include: 1. **Multi-View Diffusion Model**: By aligning the input pose to a standard pose (e.g., "A-pose") while retaining key attributes of the input image, it effectively addresses the challenges posed by different poses. 2. **Transformer-Based Sparse View Reconstruction Model**: Utilizes multi-view images to generate detailed 3D models, simplifying the reconstruction process of geometry and texture. 3. **Texture Back-Projection Strategy**: Generates high-quality texture maps, ensuring the visual quality of the final model. ### Main Contributions 1. **Multi-View Consistent Image Generation Model**: Proposes a diffusion model conditioned on images that can generate multi-view consistent standard pose images from different input poses, addressing self-occlusion and pose ambiguity issues. 2. **Efficient 3D Reconstruction Pipeline**: Combines the multi-view image generation model and the transformer-based reconstruction model to achieve efficient conversion from single-view input to detailed 3D character models. 3. **Large-Scale Dataset**: Constructs a multi-view, multi-pose dataset (Anime3D) containing 13,746 anime characters, providing rich resources for model training and evaluation. ### Summary The CharacterGen framework effectively addresses the challenges of generating high-quality 3D character models from a single image by introducing a multi-view diffusion model and a transformer-based reconstruction model. It is particularly suitable for the generation, rigging, and animation of stylized characters.

CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

Single Image, Any Face: Generalisable 3D Face Generation

Make-A-Character: High Quality Text-to-3D Character Generation within Minutes

AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

Full-body High-resolution Anime Generation with Progressive Structure-conditional Generative Adversarial Networks

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization

Guide3D: Create 3D Avatars from Text and Image Guidance

Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

HybridAvatar: Efficient Mesh-based Human Avatar Generation from Few-Shot Monocular Images with Implicit Mesh Displacement

XAGen: 3D Expressive Human Avatars Generation

$E^{3}$Gen: Efficient, Expressive and Editable Avatars Generation

GETAvatar: Generative Textured Meshes for Animatable Human Avatars

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Dynamic facial asset and rig generation from a single scan

Learning Full-Head 3D GANs from a Single-View Portrait Dataset

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images