Abstract:We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at <a class="link-external link-https" href="https://research.nvidia.com/labs/toronto-ai/xcube/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in large - scale 3D generative modeling, specifically including: 1. **High - resolution 3D generation**: Existing 3D generative models have limited resolution when generating large - scale outdoor scenes, and usually can only reach a relatively low resolution (such as 128^3). This restricts the application of the models in fields such as autonomous driving and robotics, because these scenarios require high - resolution details to accurately represent complex geometric structures. The method proposed in the paper can generate 3D voxel grids with a resolution as high as 1024^3, significantly improving the resolution of the generative model. 2. **Multi - attribute generation**: Besides generating 3D geometric structures, many application scenarios also require the model to be able to generate additional attributes, such as normals, semantic labels, truncated signed distance functions (TSDF), etc. These attributes are very important for subsequent processing and analysis. The method proposed in the paper can assign multiple attributes to the 3D voxel grid while generating it, thus supporting a wider range of applications. 3. **Efficient generation process**: Traditional 3D generation methods often require time - consuming test - time optimization, which makes the generation process very slow and computationally expensive. The method proposed in the paper can generate complex shapes containing millions of voxels within 30 seconds by using a sparse voxel hierarchy and a custom - made efficient 3D deep - learning framework (based on the VDB data structure), greatly improving the generation efficiency. 4. **User - guided editing**: In practical applications, users may need to edit the generated 3D scenes to meet specific requirements. The method proposed in the paper supports multi - scale user - guided editing. Users can control more refined 3D shapes by modifying coarse - level voxels, thus achieving flexible interactive editing. 5. **Large - scale scene generation**: Existing 3D generative models perform poorly when dealing with large - scale scenes, especially when generating large - scale outdoor scenes. Through experiments on the Waymo Open Dataset and the Karton City dataset, the paper demonstrates the effectiveness and superiority of its method in generating large - scale high - resolution scenes. In summary, the main objective of this paper is to solve the problems of existing 3D generative models in terms of resolution, multi - attribute generation, generation efficiency, user - guided editing, and large - scale scene generation by proposing a new 3D generative model - XCube, thereby promoting the application of 3D generation technology in more fields.

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Adaptive voxels: interactive rendering of massive 3D models

GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling

VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models

Neural Volumetric Mesh Generator

HyperCube: Implicit Field Representations of Voxelized 3D Models

MeshXL: Neural Coordinate Field for Generative 3D Foundation Models

Structured 3D Latents for Scalable and Versatile 3D Generation

Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering

Pushing the Limits of 3D Shape Generation at Scale

Cubixel: a novel paradigm in image processing using three-dimensional pixel representation

A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets

Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata

GALA: Geometry-Aware Local Adaptive Grids for Detailed 3D Generation

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

OctFusion: Octree-based Diffusion Models for 3D Shape Generation

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Self-supervised novel 2D view synthesis of large-scale scenes with efficient multi-scale voxel carving