Abstract:Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling.
As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### The problems the paper attempts to solve
The paper "Analyzing and Improving the Training Dynamics of Diffusion Models" aims to solve the imbalance and inefficiency problems encountered during the training of diffusion models. Specifically, the authors identify and correct several causes that lead to training imbalance and inefficiency in the popular ADM (Analogous Denoising Model) diffusion model architecture, and these improvements do not change the high - level structure of the model.
### Main problems and solutions
1. **Uncontrolled changes in network activations and weights**:
- **Problem**: During the training process, the activations and weights of the network will experience uncontrolled changes and imbalances.
- **Solution**: Redesign the network layers to keep the expected values of activations, weights, and update magnitudes constant. By systematically applying this idea, the observed drift and imbalance phenomena are eliminated.
2. **Imbalance in gradient feedback**:
- **Problem**: The magnitudes of gradient feedback at different noise levels will be different, causing their relative contributions to be re - weighted in an uncontrollable manner.
- **Solution**: Adopt the continuous generalization of multi - task loss proposed by Kendall et al., track the original loss values of noise levels, and scale the training loss by their reciprocals.
3. **Architecture simplification and stabilization**:
- **Problem**: The original architecture contains multiple types of trainable parameters, increasing the complexity of analyzing training dynamics.
- **Solution**: Remove the additive biases in convolutional and linear layers, uniformly initialize all weights, simplify the normalization layer, and use the cosine attention mechanism to prevent the attention map from becoming fragile and sharp.
4. **Normalizing activation magnitudes**:
- **Problem**: Despite the use of group normalization, the activation magnitudes still grow uncontrollably during the training process.
- **Solution**: Introduce forced weight normalization, divide the output of each layer by the expected value of the activation magnitude caused by that layer, thereby keeping the expected value of the activation magnitude constant.
5. **Normalizing weights and updates**:
- **Problem**: Even after normalizing network activations, network weights still tend to grow.
- **Solution**: Through forced weight normalization, explicitly normalize each weight vector to unit variance, and apply "standard" weight normalization before each training step to ensure that Adam's variance estimate is carried out on the actual tangent plane.
6. **Removing group normalization layers**:
- **Problem**: Group normalization layers may have an adverse effect on pixel operations.
- **Solution**: Remove all group normalization layers, introduce a weaker pixel normalization layer, remove the second linear layer in the embedding network and the nonlinear part of the network output, and merge the resampling operations in the residual blocks.
7. **Maintaining the magnitudes of fixed - function layers**:
- **Problem**: There are still layers in the network that do not maintain the activation magnitudes.
- **Solution**: Adjust the scales of the sine and cosine functions of the Fourier features, modify the output of the SiLU nonlinear layer, adjust the weights of the U - Net skip connections to ensure equal contributions of the inputs.
### Experimental results
Through the above series of improvements, the authors significantly improve the quality of the model. Especially in the ImageNet - 512 synthesis task, the previous FID (Frechet Inception Distance) record is reduced from 2.41 to 1.81 while keeping the computational complexity of the model unchanged. In addition, they also propose a method that can set the exponential moving average (EMA) parameter after the training is completed, thereby precisely adjusting the EMA length without multiple trainings.
### Summary
This paper solves the imbalance and inefficiency problems in the training process by systematically analyzing and improving the training dynamics of diffusion models, significantly enhancing the performance of the model. These improvements are not only innovative in technology but also perform excellently in practical applications.