G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer

Jinzhi Zhang,Feng Xiong,Mu Xu
2024-09-10
Abstract:Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenges encountered by existing autoregressive models when extended to 3D generation tasks. Specifically, traditional autoregressive models rely on next - token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Therefore, these models perform poorly when handling 3D generation tasks. To address this issue, the authors propose G3PT (Generative 3D Point Transformer), an extensible coarse - to - fine 3D generation model that utilizes the Cross - scale Querying Transformer (CQT). The key innovation of G3PT lies in mapping point - based 3D data into discrete tokens with different levels of detail, thereby naturally establishing an order relationship between different levels, making it suitable for autoregressive modeling. In addition, CQT connects tokens at different levels of detail globally through cross - attention layers without requiring a specific order. Specifically, G3PT addresses the following problems: 1. **Unordered nature of 3D data**: Traditional autoregressive models rely on an ordered token sequence for prediction, but 3D data is unordered. G3PT avoids imposing an artificial order on 3D data by introducing a cross - scale query mechanism, thereby better handling the unordered nature of 3D data. 2. **Multi - scale representation**: 3D data has a natural multi - level characteristic. G3PT can effectively capture and generate 3D structural information at different scales by mapping 3D data into discrete tokens at different levels of detail. 3. **Conditional generation**: G3PT supports diverse conditional structures, such as image - and text - based inputs, so that the generated 3D shapes can be consistent with the given conditions. 4. **High - quality generation**: Experiments show that G3PT is significantly superior to previous methods in terms of 3D generation quality and generalization ability, and for the first time reveals the power - law scaling behavior in 3D generation. ### Formula summary - **Quantification formula**: \[ \text{Index}(z)=\sum_{i = 1}^{\log_2 C}2^{i - 1}\{\zeta_i>0\} \] \[ \hat{z}=\hat{q}(\text{sign}(\zeta))=\hat{q}(\text{sign}(q(z))) \] - **Autoregressive modeling**: - **Next - token prediction**: \[ P(x)=\prod_{i = 1}^N P(x_i|x_1,x_2,\ldots,x_{i - 1}) \] - **Next - scale prediction**: \[ P(x)=\prod_{s = 1}^S P(x^{(s)}|x^{(1)},x^{(2)},\ldots,x^{(s - 1)}) \] - **Cross - scale querying Transformer**: \[ Z = \text{CrossAttn}(Lat,\text{PosEmb}(X)) \] \[ E_s=\text{CrossAttn}_{\text{down}}(e_s,Z_s) \] \[ \tilde{Z}_s=\text{CrossAttn}_{\text{up}}(\tilde{e}_s,\hat{E}_s) \] Through these methods, G3PT not only solves the key problems in 3D generation but also demonstrates excellent performance in 3D content creation.