Abstract:Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the view - inconsistency problem and the insufficient shape - control problem existing in the existing text - to - 3D generation methods. Specifically: 1. **View - Inconsistency Problem**: Existing methods based on score distillation are prone to the view - inconsistency problem when generating 3D models, that is, when observing the generated 3D models from different angles, their appearances may be inconsistent. 2. **Insufficient Shape Control**: Existing text - to - 3D generation methods are difficult to precisely control the shape of the generated 3D objects, which may lead to the generated object having an arbitrary shape and not meeting expectations. To solve these problems, the paper proposes a new framework - **Points - to - 3D**, which realizes more realistic and controllable text - to - 3D generation by combining the knowledge distillation of sparse 3D point clouds and pre - trained 2D and 3D diffusion models. Specifically, the core idea of Points - to - 3D is to introduce controllable sparse 3D points to guide text - to - 3D generation, ensuring that the generated 3D content has consistency and controllability from different perspectives. ### Main Contributions - Proposed a novel and flexible text - to - 3D generation framework, Points - to - 3D, which bridges the gap between sparse 3D points and more realistic, shape - controllable 3D generation by distilling the knowledge of pre - trained 2D and 3D diffusion models. - In order to make full use of sparse 3D points, an effective point cloud guidance loss is proposed to optimize the geometry of NeRF, and the appearance is optimized in the compact latent space by using ControlNet through score distillation. - Experimental results show that Points - to - 3D can significantly reduce view - inconsistency and achieve good control of 3D shapes. Through these improvements, Points - to - 3D provides a new method to improve and control the quality of text - to - 3D generation.

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

Control3D: Towards Controllable Text-to-3D Generation

Text-Free Controllable 3-D Point Cloud Generation

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

Precise-Physics Driven Text-to-3D Generation

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Diffusion-SDF: Text-to-Shape Via Voxelized Diffusion