Abstract:The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: \href{<a class="link-external link-https" href="https://star-avatar.github.io" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://star-avatar.github.io" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper proposes a solution to the problem of text-based 4D virtual image generation. Existing methods usually adopt text-to-image (T2I) diffusion models to synthesize 3D images from textual descriptions, and then apply target motions for animation. This approach has two drawbacks: 1) for pose-invariant optimization, the images rendered with fixed standard poses using original score distillation sampling (SDS) have domain gaps, and cannot maintain view consistency solely through T2I prior; 2) when performing animation, simply applying source motions to target 3D images can result in translation distortion and alignment errors. To address these issues, the paper proposes a model called STAR (Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting). STAR takes into account the geometric and skeletal differences between the template mesh and target image, and utilizes pre-training motion retargeting techniques to correct mismatched source motions. By incorporating informative retargeting and occlusion-aware skeleton, STAR combines skeleton-conditioned T2I and T2V priors, and introduces a hybrid SDS module to provide multi-view and frame-consistent supervision signals, thereby optimizing geometry, texture, and motion in an end-to-end manner. The experimental results demonstrate that STAR can generate high-quality 4D virtual images with vivid animations aligned well with textual descriptions. In addition, the paper conducts ablation studies to showcase the contributions of each component of STAR, and provides source code and demonstrations. In summary, this paper addresses the issues of view consistency, animation quality, and skeleton-geometry matching in text-based 4D virtual image generation process. By integrating motion retargeting and hybrid SDS techniques, it improves the realism and animation effects of the generated 4D virtual images.

STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

TADA! Text to Animatable Digital Avatars

AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

HeadSculpt: Crafting 3D Head Avatars with Text

Barbie: Text to Barbie-Style 3D Avatars

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

SEEAvatar: Photorealistic Text-to-3D Avatar Generation with Constrained Geometry and Appearance

AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars

Disentangled Clothed Avatar Generation from Text Descriptions

HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

TextToon: Real-Time Text Toonify Head Avatar from Single Video

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose

AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations