AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

Xinzhou Wang,Yikai Wang,Junliang Ye,Zhengyi Wang,Fuchun Sun,Pengkun Liu,Ling Wang,Kai Sun,Xintong Wang,Bin He

2024-03-28

Abstract:Advances in 3D generation have facilitated sequential 3D model generation (a.k.a 4D generation), yet its application for animatable objects with large motion remains scarce. Our work proposes AnimatableDreamer, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects on skeletons extracted from a monocular video. At its core, AnimatableDreamer is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which lifts 2D diffusion for temporal consistent 4D generation. CSD, designed from a score gradient perspective, generates a canonical model with warp-robustness across different articulations. Notably, it also enhances the authenticity of bones and skinning by integrating inductive priors from a diffusion model. Furthermore, with multi-view distillation, CSD infers invisible regions, thereby improving the fidelity of monocular non-rigid reconstruction. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over existing non-rigid reconstruction methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper introduces a framework named AnimatableDreamer, designed to address the generation and reconstruction of non-rigid 3D models, especially for animatable objects with large-scale motion. The framework leverages textual guidance and skeletal information from monocular videos to generate diverse non-rigid objects while ensuring temporal consistency and deformation robustness. The core innovation lies in a novel optimization design called Canonical Score Distillation (CSD). CSD starts from the perspective of score gradients to generate deformation-robust canonical models and combines the inductive priors of diffusion models to enhance the realism of skeletons and skins. Through multi-view distillation, CSD can infer invisible areas, thereby improving the fidelity of monocular non-rigid reconstruction. Moreover, AnimatableDreamer first extracts skeletons and skins from monocular videos, then generates text-prompted non-rigid 3D models on these skeletons. While maintaining temporal consistency, CSD optimizes the weights of skeletons and skins to ensure the morphological rationality of the model at different joint positions. Experimental results show that AnimatableDreamer performs excellently in generating highly flexible text-guided 3D models from monocular videos and in performance improvement compared to existing non-rigid reconstruction methods. Therefore, this approach is of significant importance for the automatic construction of animatable 3D models in fields such as gaming, virtual reality, and movie special effects.

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

DreamMapping: High-Fidelity Text-to-3D Generation via Variational Distribution Mapping

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution

DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

4Dynamic: Text-to-4D Generation with Hybrid Priors

OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control

StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D

VividDreamer: Invariant Score Distillation For Hyper-Realistic Text-to-3D Generation

Retrieval-Augmented Score Distillation for Text-to-3D Generation

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation