Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Paul Jeha,Will Grathwohl,Michael Riis Andersen,Carl Henrik Ek,Jes Frellsen
2024-08-22
Abstract:Score-based models, trained with denoising score matching, are remarkably effective in generating high dimensional data. However, the high variance of their training objective hinders optimisation. We attempt to reduce it with a control variate, derived via a $k$-th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive gradient variance in the training process of diffusion models. Specifically, the author focuses on score - based models, which are trained through denoising score matching (DSM). However, due to the high - variance nature of the training objective function, the optimization process becomes difficult. To solve this problem, the author proposes a control variate method based on Taylor expansion to reduce the variance of the training objective and its gradient. Specific contributions include: 1. **Derivation of control variates for Taylor polynomials of arbitrary order**: The author proposes a general framework that can derive Taylor polynomials of arbitrary order as control variates for the training objective function and its gradient. 2. **Proof of the equivalence of controlling the training objective and its gradient**: The author proves that there is an equivalence relationship between controlling the training objective and controlling its gradient, which provides a theoretical basis for future research. 3. **Empirical importance of regression coefficients**: The author shows the importance of regression coefficients for the effect of control variates through experiments. 4. **Validation of effectiveness in low - dimensional problem settings**: The author conducts empirical research in low - dimensional problems to verify the effectiveness of the proposed method. 5. **Study of the impact in high - dimensional problems**: The author explores the impact of control variates in high - dimensional problems and points out their limitations. 6. **Limitations of control variates based on Taylor expansion**: The author shows the limitations of Taylor expansion when dealing with complex networks, especially in the case of large noise values (σ). ### Formula summary - **Training objective function**: \[ L_\theta(z, x, \sigma)=\frac{1}{2}\left\|\frac{z}{\sigma}+s_\theta(x + \sigma z)\right\|^2 \] - **Control variate**: \[ C^k_\theta(z, x, \sigma)=\frac{\|z\|^2 - D}{2\sigma^2}+\frac{1}{2}\sum_{|\alpha|\leq k}\sum_{|\rho|\leq k}\frac{\sigma^{|\alpha|+|\rho|}}{\alpha!\rho!}\left(z^{\alpha+\rho}-\delta_{\alpha+\rho}\right)\partial^\alpha s_\theta(x)^T\partial^\rho s_\theta(x)+\sum_{|\alpha|\leq k}\frac{\sigma^{|\alpha|-1}}{\alpha!}\left(z^\alpha z^T - E[z^\alpha z]\right)\partial^\alpha s_\theta(x) \] - **Control variate for controlling the gradient**: \[ C^k_{g,\theta}(z, x, \sigma)=\sum_{|\rho|\leq k}\frac{\sigma^{|\rho|-1}}{\rho!}(z^\rho z - E[z^\rho z])^T\partial^\rho\partial_\theta s_\theta(x)+\sum_{|\rho|\leq k}\sum_{|\alpha|\leq k}\frac{\sigma^{|\alpha|+|\rho|}}{\alpha!\rho!}\left(z^{\alpha+\rho}\right)