Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Nithin Gopalakrishnan Nair,Jeya Maria Jose Valanarasu,Vishal M. Patel

2024-04-16

Abstract:Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper focuses on how to enhance the generation capability of Diffusion Transformers to efficiently extend to multiple different datasets and tasks. Existing diffusion models usually require separate training for each dataset, but this research proposes a strategy called DiffScaler that adapts to different tasks by training a small number of parameters. DiffScaler learns task-specific transformations at each layer, utilizes subspaces learned by pre-trained models, and adds new task-specific subspaces that may be missing in pre-training data. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. The approach of DiffScaler allows a pre-trained model to handle various conditional and unconditional image generation tasks by fine-tuning a small number of parameters, achieving similar performance to that of a diffusion model specifically fine-tuned for a particular task. Experiments show that Transformer-based diffusion models outperform CNN-based models in fine-tuning on small datasets, and DiffScaler can be effectively applied to both unconditional and conditional image generation tasks, as well as convolutional and Transformer-based diffusion models. In addition, DiffScaler introduces a lightweight Affiner module for learnable scaling of weights and biases, as well as learning low-dimensional subspaces for new tasks to achieve more efficient scaling. In this way, the model can adapt to new tasks with less than 1% of the parameters while retaining pre-training information, resulting in a computationally efficient multi-task model.

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Scalable Diffusion Models with Transformers

Exploring Vision Transformers as Diffusion Learners

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Effective Diffusion Transformer Architecture for Image Super-Resolution

DiffiT: Diffusion Vision Transformers for Image Generation

On the Scalability of Diffusion-based Text-to-Image Generation

LaVin-DiT: Large Vision Diffusion Transformer

TerDiT: Ternary Diffusion Models with Transformers

One Diffusion to Generate Them All

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

Dynamic Diffusion Transformer

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

Diffusion Models Trained with Large Data Are Transferable Visual Models

Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

GenTron: Diffusion Transformers for Image and Video Generation

Accelerating Vision Diffusion Transformers with Skip Branches

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures