Abstract:Text-to-4D generation has recently been demonstrated viable by integrating a 2D image diffusion model with a video diffusion model. However, existing models tend to produce results with inconsistent motions and geometric structures over time. To this end, we present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user-supplied prompts. The primary challenges of our mesh-based framework involve stably generating a mesh with details that align with the text prompt while directly driving it and maintaining surface continuity. Our CT4D framework incorporates a unique Generate-Refine-Animate (GRA) algorithm to enhance the creation of text-aligned meshes. To improve surface continuity, we divide a mesh into several smaller regions and implement a uniform driving function within each area. Additionally, we constrain the animating stage with a rigidity regulation to ensure cross-region continuity. Our experimental results, both qualitative and quantitative, demonstrate that our CT4D framework surpasses existing text-to-4D techniques in maintaining interframe consistency and preserving global geometry. Furthermore, we showcase that this enhanced representation inherently possesses the capability for combinational 4D generation and texture editing.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the temporal inconsistency and geometric structure distortion in the generation results of existing text - to - 4D generation methods. Specifically: 1. **Temporal Inconsistency**: When existing text - to - 4D generation methods generate dynamic scenes, motion inconsistency often occurs between frames, resulting in poor visual effects, especially obvious jitter in the object edge areas. 2. **Geometric Structure Distortion**: These methods have difficulty maintaining the stability of geometric structures when adding motion, leading to geometric structure distortion in the generated objects during the dynamic phase, especially in the object edges and details. To overcome these problems, the paper proposes a new framework based on animatable triangle meshes - CT4D (Consistent Text - to - 4D Generation with Animatable Meshes). This framework solves the above problems through the following innovations: - **Explicit Representation**: Use animatable triangle meshes as 4D representation to explicitly separate geometric structures and textures, thereby more effectively maintaining the stability of geometric structures and temporal consistency. - **Generate - Refine - Animate (GRA) Algorithm**: Through a three - stage algorithm (Generate - Refine - Animate), gradually generate high - quality geometric structures and textures, and ensure the smoothness and continuity of the animation. - **Surface Continuity**: Through vertex clustering and rigidity regulation techniques, ensure the surface continuity between various regions during the animation process and avoid geometric structure distortion. Through these innovations, the CT4D framework can significantly improve the inter - frame consistency and the stability of geometric structures when generating high - quality 4D content. Experimental results show that the CT4D framework is significantly superior to existing text - to - 4D generation methods in these aspects.

CT4D: Consistent Text-to-4D Generation with Animatable Meshes

Chasing Consistency in Text-to-3D Generation from a Single Image.

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

4Dynamic: Text-to-4D Generation with Hybrid Priors

Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Comp4D: LLM-Guided Compositional 4D Scene Generation

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Animate124: Animating One Image to 4D Dynamic Scene

TC4D: Trajectory-Conditioned Text-to-4D Generation

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models

Control4D: Efficient 4D Portrait Editing with Text

AToM: Amortized Text-to-Mesh using 2D Diffusion

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

PaintScene4D: Consistent 4D Scene Generation from Text Prompts