Abstract:Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the challenges encountered by existing autoregressive models when extended to 3D generation tasks. Specifically, traditional autoregressive models rely on next - token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Therefore, these models perform poorly when handling 3D generation tasks. To address this issue, the authors propose G3PT (Generative 3D Point Transformer), an extensible coarse - to - fine 3D generation model that utilizes the Cross - scale Querying Transformer (CQT). The key innovation of G3PT lies in mapping point - based 3D data into discrete tokens with different levels of detail, thereby naturally establishing an order relationship between different levels, making it suitable for autoregressive modeling. In addition, CQT connects tokens at different levels of detail globally through cross - attention layers without requiring a specific order. Specifically, G3PT addresses the following problems: 1. **Unordered nature of 3D data**: Traditional autoregressive models rely on an ordered token sequence for prediction, but 3D data is unordered. G3PT avoids imposing an artificial order on 3D data by introducing a cross - scale query mechanism, thereby better handling the unordered nature of 3D data. 2. **Multi - scale representation**: 3D data has a natural multi - level characteristic. G3PT can effectively capture and generate 3D structural information at different scales by mapping 3D data into discrete tokens at different levels of detail. 3. **Conditional generation**: G3PT supports diverse conditional structures, such as image - and text - based inputs, so that the generated 3D shapes can be consistent with the given conditions. 4. **High - quality generation**: Experiments show that G3PT is significantly superior to previous methods in terms of 3D generation quality and generalization ability, and for the first time reveals the power - law scaling behavior in 3D generation. ### Formula summary - **Quantification formula**: \[ \text{Index}(z)=\sum_{i = 1}^{\log_2 C}2^{i - 1}\{\zeta_i>0\} \] \[ \hat{z}=\hat{q}(\text{sign}(\zeta))=\hat{q}(\text{sign}(q(z))) \] - **Autoregressive modeling**: - **Next - token prediction**: \[ P(x)=\prod_{i = 1}^N P(x_i|x_1,x_2,\ldots,x_{i - 1}) \] - **Next - scale prediction**: \[ P(x)=\prod_{s = 1}^S P(x^{(s)}|x^{(1)},x^{(2)},\ldots,x^{(s - 1)}) \] - **Cross - scale querying Transformer**: \[ Z = \text{CrossAttn}(Lat,\text{PosEmb}(X)) \] \[ E_s=\text{CrossAttn}_{\text{down}}(e_s,Z_s) \] \[ \tilde{Z}_s=\text{CrossAttn}_{\text{up}}(\tilde{e}_s,\hat{E}_s) \] Through these methods, G3PT not only solves the key problems in 3D generation but also demonstrates excellent performance in 3D content creation.

G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers

Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

Pushing the Limits of 3D Shape Generation at Scale

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

PointGPT: Auto-regressively Generative Pre-training from Point Clouds

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Atlas Gaussians Diffusion for 3D Generation

3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation

GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images

PV3D: A 3D Generative Model for Portrait Video Generation

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Deep Generative Models on 3D Representations: A Survey

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Turbo3D: Ultra-fast Text-to-3D Generation

iVideoGPT: Interactive VideoGPTs are Scalable World Models