Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Nithin Gopalakrishnan Nair,Jeya Maria Jose Valanarasu,Vishal M. Patel
2024-04-16
Abstract:Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper focuses on how to enhance the generation capability of Diffusion Transformers to efficiently extend to multiple different datasets and tasks. Existing diffusion models usually require separate training for each dataset, but this research proposes a strategy called DiffScaler that adapts to different tasks by training a small number of parameters. DiffScaler learns task-specific transformations at each layer, utilizes subspaces learned by pre-trained models, and adds new task-specific subspaces that may be missing in pre-training data. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. The approach of DiffScaler allows a pre-trained model to handle various conditional and unconditional image generation tasks by fine-tuning a small number of parameters, achieving similar performance to that of a diffusion model specifically fine-tuned for a particular task. Experiments show that Transformer-based diffusion models outperform CNN-based models in fine-tuning on small datasets, and DiffScaler can be effectively applied to both unconditional and conditional image generation tasks, as well as convolutional and Transformer-based diffusion models. In addition, DiffScaler introduces a lightweight Affiner module for learnable scaling of weights and biases, as well as learning low-dimensional subspaces for new tasks to achieve more efficient scaling. In this way, the model can adapt to new tasks with less than 1% of the parameters while retaining pre-training information, resulting in a computationally efficient multi-task model.