Abstract:Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention that bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at <a class="link-external link-https" href="https://github.com/YuchuanTian/U-DiT" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in diffusion models, how to combine the advantages of the U - Net architecture and the Transformer architecture to improve the performance and computational efficiency of image generation tasks. Specifically, the author noticed that traditional Diffusion Transformers (DiTs) adopt an isotropic architecture, that is, simply stacking Transformer blocks, while abandoning the widely - used U - Net architecture. However, the U - Net architecture has potential advantages in denoising tasks. Therefore, the author re - considered the possibility of applying the U - Net architecture to Diffusion Transformers and verified this idea through a series of experiments.
### Main problems and solutions
1. **The potential of the U - Net architecture in Diffusion Transformers**:
- **Problem**: Traditional DiTs abandon the U - Net architecture and instead use an isotropic architecture, but the U - Net architecture has its unique advantages in denoising tasks.
- **Solution**: The author first tried a simple U - Net - style DiT (called DiT - UNet) and compared it with the isotropic DiT. The results showed that DiT - UNet only showed limited advantages under similar computational costs, indicating that the inductive bias of U - Net was not fully utilized.
2. **Optimization of low - frequency features and self - attention mechanism**:
- **Problem**: The features of the U - Net backbone network are mainly concentrated in the low - frequency domain, and high - frequency information is mostly noise. Therefore, directly applying the full - scale self - attention mechanism may introduce redundancy.
- **Solution**: The author proposed to down - sample the query, key, and value (QKV) to reduce the amount of computation and highlight low - frequency information. This down - sampled self - attention mechanism not only improves performance but also significantly reduces computational costs.
3. **Expanding the model scale to verify the effect**:
- **Problem**: It is necessary to verify the performance of the proposed U - DiT model on a larger scale.
- **Solution**: Based on the above findings, the author designed a series of U - DiT models of different scales and verified the superior performance of these models in ImageNet 256×256 and 512×512 image generation tasks through a large number of experiments.
### Experimental results
Through extensive experiments, the author proved that the U - DiT model is superior to existing DiT models in multiple aspects:
- **Performance improvement**: The U - DiT model significantly outperforms the DiT model in the Fréchet Inception Distance (FID) metric, especially under lower computational costs.
- **Computational efficiency**: The U - DiT model saves about 1/3 of the computational amount through the down - sampled self - attention mechanism.
- **Scalability**: The U - DiT model still maintains excellent performance in larger - scale datasets and longer - term training.
### Formula explanation
The formulas involved in the paper include the complexity analysis of the self - attention mechanism:
Given the input feature size of \(N\times N\) and dimension \(d\), let \(Q, K, V\in\mathbb{R}^{N^{2}\times d}\) be the mapped query, key, and value tuples. The complexity of the self - attention mechanism is:
\[ X = A V, \quad \text{where} \quad A=\text{Softmax}(QK^{T}) \]
In the down - sampled self - attention mechanism, four down - sampled query, key, and value tuples perform self - attention operations respectively:
\[ 4\times(Q_{\downarrow 2}, K_{\downarrow 2}, V_{\downarrow 2})\in\mathbb{R}^{(\frac{N}{2})^{2}\times d} \]
The cost of each self - attention operation is only 1/16 of that of the full - scale self - attention, so the total cost is 1/4 of that of the full - scale self - attention, saving 3/4 of the computational amount.
In summary, this paper combines the advantages of the U - Net architecture and the Transformer architecture to propose...