Abstract:In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.
Machine Learning,Computer Vision and Pattern Recognition,Image and Video Processing,Signal Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to conduct in - depth research on the generalization ability of diffusion models, especially by analyzing the hidden properties of the score functions learned by these models. Diffusion models are a type of generative model, widely used in fields such as image generation. Although these models perform excellently in practical applications, the generalization mechanism behind them is still not fully understood. The main contribution of the paper lies in revealing the learning tendency of diffusion models towards the Gaussian structure during the generalization process, and exploring the relationship between this tendency and the model capacity and training time.
### Main findings
1. **Inductive bias of Gaussian structure**:
- Diffusion models show an inductive bias towards the Gaussian structure during the generalization process. Specifically, when the model transitions from the memorization mode to the generalization mode, its nonlinear denoisers gradually exhibit linear characteristics.
- These linear denoisers can be approximately regarded as the optimal denoisers for multivariate Gaussian distributions, which are defined by the empirical mean and covariance of the training data.
2. **Influence of model capacity and training time**:
- When the model capacity is relatively small compared to the size of the training data set, this inductive bias of the Gaussian structure is most significant.
- Even if the model is highly over - parameterized, this inductive bias will also appear in the early stage of training, that is, before the model completely memorizes the training data. This indicates that early stopping of training can promote the generalization ability of over - parameterized models.
3. **Connection between strong generalization and Gaussian structure**:
- The paper believes that the recently observed strong generalization ability of diffusion models stems from the model learning the low - dimensional structural features shared among different data sets. These low - dimensional features can be partially explained by the Gaussian structure.
### Research methods
- **Linear distillation technique**: By training a series of linear models to approximate the nonlinear diffusion denoisers, thereby revealing their internal linear structures.
- **Experimental verification**: By quantifying the linearity of denoisers and the approximation error of the score field, verify the similarity between linear models and nonlinear models.
- **Theoretical analysis**: Prove that when minimizing the denoising score - matching objective, the optimal solution under linear constraints is a Gaussian denoiser.
### Experimental results
- **High - noise region**: Linear models and Gaussian models show extremely high similarity in the high - noise region (σ(t) ∈ [20, 80]), and can effectively approximate nonlinear models.
- **Low - noise region**: In the low - noise region (σ(t) ∈ [0.002, 0.1]), linear models and Gaussian models can still effectively approximate nonlinear models.
- **Intermediate - noise region**: In the intermediate - noise region (σ(t) ∈ [0.1, 20]), although the nonlinear model exhibits significant nonlinear characteristics, the linear model and Gaussian model can still maintain a low approximation error.
### Conclusion
This paper reveals the inductive bias of diffusion models towards the Gaussian structure by analyzing the behavior of diffusion models during the generalization process. This finding not only helps to understand the powerful generalization ability of diffusion models, but also provides a new perspective for optimizing model design and training strategies.