XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Xuanchi Ren,Jiahui Huang,Xiaohui Zeng,Ken Museth,Sanja Fidler,Francis Williams
2024-06-26
Abstract:We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at <a class="link-external link-https" href="https://research.nvidia.com/labs/toronto-ai/xcube/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in large - scale 3D generative modeling, specifically including: 1. **High - resolution 3D generation**: Existing 3D generative models have limited resolution when generating large - scale outdoor scenes, and usually can only reach a relatively low resolution (such as 128^3). This restricts the application of the models in fields such as autonomous driving and robotics, because these scenarios require high - resolution details to accurately represent complex geometric structures. The method proposed in the paper can generate 3D voxel grids with a resolution as high as 1024^3, significantly improving the resolution of the generative model. 2. **Multi - attribute generation**: Besides generating 3D geometric structures, many application scenarios also require the model to be able to generate additional attributes, such as normals, semantic labels, truncated signed distance functions (TSDF), etc. These attributes are very important for subsequent processing and analysis. The method proposed in the paper can assign multiple attributes to the 3D voxel grid while generating it, thus supporting a wider range of applications. 3. **Efficient generation process**: Traditional 3D generation methods often require time - consuming test - time optimization, which makes the generation process very slow and computationally expensive. The method proposed in the paper can generate complex shapes containing millions of voxels within 30 seconds by using a sparse voxel hierarchy and a custom - made efficient 3D deep - learning framework (based on the VDB data structure), greatly improving the generation efficiency. 4. **User - guided editing**: In practical applications, users may need to edit the generated 3D scenes to meet specific requirements. The method proposed in the paper supports multi - scale user - guided editing. Users can control more refined 3D shapes by modifying coarse - level voxels, thus achieving flexible interactive editing. 5. **Large - scale scene generation**: Existing 3D generative models perform poorly when dealing with large - scale scenes, especially when generating large - scale outdoor scenes. Through experiments on the Waymo Open Dataset and the Karton City dataset, the paper demonstrates the effectiveness and superiority of its method in generating large - scale high - resolution scenes. In summary, the main objective of this paper is to solve the problems of existing 3D generative models in terms of resolution, multi - attribute generation, generation efficiency, user - guided editing, and large - scale scene generation by proposing a new 3D generative model - XCube, thereby promoting the application of 3D generation technology in more fields.