Abstract:Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

DiC: Rethinking Conv3x3 Designs in Diffusion Models

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Accelerating Vision Diffusion Transformers with Skip Branches

DiCENet: Dimension-wise Convolutions for Efficient Networks

Scalable Diffusion Models with State Space Backbone

Scalable Diffusion Models with Transformers

TerDiT: Ternary Diffusion Models with Transformers

Dynamic Diffusion Transformer

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Effective Diffusion Transformer Architecture for Image Super-Resolution

Scaling Diffusion Transformers to 16 Billion Parameters

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

DiTFastAttn: Attention Compression for Diffusion Transformer Models

DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up