Abstract:Diffusion models have emerged as a popular method for 3D generation. However, it is still challenging for diffusion models to efficiently generate diverse and high-quality 3D shapes. In this paper, we introduce OctFusion, which can generate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia 4090 GPU, and the extracted meshes are guaranteed to be continuous and manifold. The key components of OctFusion are the octree-based latent representation and the accompanying diffusion models. The representation combines the benefits of both implicit neural representations and explicit spatial octrees and is learned with an octree-based variational autoencoder. The proposed diffusion model is a unified multi-scale U-Net that enables weights and computation sharing across different octree levels and avoids the complexity of widely used cascaded diffusion schemes. We verify the effectiveness of OctFusion on the ShapeNet and Objaverse datasets and achieve state-of-the-art performances on shape generation tasks. We demonstrate that OctFusion is extendable and flexible by generating high-quality color fields for textured mesh generation and high-quality 3D shapes conditioned on text prompts, sketches, or category labels. Our code and pre-trained models are available at \url{<a class="link-external link-https" href="https://github.com/octree-nn/octfusion" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem this paper attempts to address is how to efficiently generate high-quality, high-resolution 3D shapes. Specifically, existing diffusion models face two main challenges when generating 3D shapes: 1. how to efficiently represent 3D shapes; and 2. how to train diffusion models related to these representations. Current methods either have low generation efficiency or the generated 3D shapes have low resolution. To solve these problems, the paper proposes OctFusion, an octree-based diffusion model that can efficiently generate 3D shapes with arbitrary resolution. The main contributions of OctFusion include: 1. **Octree-based implicit representation**: Representing 3D shapes as a volumetric octree, with an implicit feature attached to each leaf node. These features are decoded into local signed distance fields (SDF) through a shared multi-layer perceptron (MLP) and then fused into a global SDF through a multi-level partition of unity (MPU) module. This representation method combines the advantages of implicit representation and explicit spatial octree structure, capable of representing continuous fields and expressing complex geometric and texture details. 2. **Unified multi-scale diffusion model**: Designing a unified multi-scale U-Net that can share weights and computational resources across different octree levels, significantly reducing the number of parameters and training complexity, enabling the model to efficiently generate detailed 3D shapes. 3. **Efficient generation**: OctFusion can generate high-quality 3D shapes within 2.5 seconds on a single Nvidia 4090 GPU, and the generated meshes are guaranteed to be continuous and manifold. 4. **Wide applicability**: OctFusion supports not only unconditional generation but also conditional generation based on text prompts, sketches, or category labels, demonstrating its superior performance in various tasks. Through experiments on the ShapeNet and Objaverse datasets, OctFusion achieved state-of-the-art performance in 3D shape generation tasks with only 33M trainable parameters. The generated implicit fields are guaranteed to be continuous and can be converted into meshes of any resolution. Additionally, OctFusion also excels in generating 3D shapes with textures.

OctFusion: Octree-based Diffusion Models for 3D Shape Generation

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Part-aware Shape Generation with Latent 3D Diffusion of Neural Voxel Fields

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation

Diffusion-SDF: Text-to-Shape Via Voxelized Diffusion

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

LION: Latent Point Diffusion Models for 3D Shape Generation

Deformable 3D Shape Diffusion Model

Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models

Locally Attentional SDF Diffusion for Controllable 3D Shape Generation

Topology-Aware Latent Diffusion for 3D Shape Generation

Generating Images with 3D Annotations Using Diffusion Models

MeshDiffusion: Score-based Generative 3D Mesh Modeling

ShapeFusion: A 3D diffusion model for localized shape editing

Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion