Abstract:Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results along these lines for one of the most fundamental distribution families, Gaussian mixture models. We prove that gradient descent on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model in the following two settings: 1) We show gradient descent with random initialization learns mixtures of two spherical Gaussians in $d$ dimensions with $1/\text{poly}(d)$-separated centers. 2) We show gradient descent with a warm start learns mixtures of $K$ spherical Gaussians with $\Omega(\sqrt{\log(\min(K,d))})$-separated centers. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning, the EM algorithm and spectral methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the context of Gaussian Mixture Models (GMMs), whether accurate score estimation can be achieved by optimizing the Denoising Diffusion Probabilistic Model (DDPM) objective function through gradient descent. Specifically, the author aims to prove that under certain conditions, using the DDPM objective function for gradient descent can efficiently learn the true parameters of the Gaussian Mixture Model. ### Problem Background In recent years, Diffusion Models have received extensive attention as a powerful generative modeling framework. The core idea of these models is to learn the distribution through denoising or score estimation (i.e., the gradient of the log - density of the data distribution). DDPM is a commonly used score - matching objective function, which is optimized by minimizing the difference between the predicted noise and the actual noise. However, although much theoretical work has proven the effectiveness of diffusion models under certain assumptions, most of these works rely on the existence of a "oracle" for score estimation and do not clearly show how to provide a provable score - estimation method for interesting distribution families such as Gaussian Mixture Models. Therefore, a key question is: **Are there natural data distributions under which gradient descent can be proven to achieve accurate score estimation?** ### Research Contributions In this paper, the author focuses on the class of Gaussian Mixture Model distributions and proves the following two main results: 1. **Theorem 1 (informal statement)**: For a mixture model of two spherical Gaussians, if the distance between their centers is $ \frac{1}{\text{poly}(d)} $, then starting from a random initialization, gradient descent can efficiently learn the true parameters of the model on the DDPM objective function. 2. **Theorem 2 (informal statement)**: For a mixture model of $ K $ spherical Gaussians, if there is an initial value close to the true center and the distance between the centers is $ \Omega(\sqrt{\log(\min(K,d))}) $, then gradient descent can efficiently learn the true parameters of the model on the DDPM objective function. ### Technical Overview To prove the above results, the author relates the behavior of gradient descent at different noise levels to two classic algorithms - Power Iteration and Expectation - Maximization (EM) algorithm: - **Large noise level**: At a large noise level, the behavior of gradient descent is similar to that of power iteration, which helps to find a solution in the same direction as the true parameters. - **Small noise level**: At a small noise level, the behavior of gradient descent is similar to the M - step update in the EM algorithm, so it can quickly converge to the true parameters. In addition, the author also discusses how to handle smaller separation distances and extend to the general case of $ K $ Gaussian distributions. ### Conclusion The main contribution of this paper is that it provides, for the first time, provable efficiency results of optimizing the DDPM objective function by gradient descent in the context of Gaussian Mixture Models. This not only deepens our understanding of diffusion models but also provides new perspectives and tools for score estimation in practical applications.

Learning Mixtures of Gaussians Using the DDPM Objective

Gaussian mixture density modeling and decomposition with weighted likelihood

Learning Mixtures of Gaussians Using Diffusion Models

Learning general Gaussian mixtures with efficient score matching

Mix-DDPM: Enhancing Diffusion Models Through Fitting Mixture Noise with Global Stochastic Offset

Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow

Sequential Learning for Dirichlet Process Mixtures

Learning Mixtures of Arbitrary Distributions over Large Discrete Domains.

Diffusion Model Conditioning on Gaussian Mixture Model and Negative Gaussian Mixture Gradient

Fast Deep Mixtures of Gaussian Process Experts

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Privately Learning Mixtures of Axis-Aligned Gaussians

Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models

Learning Probability Density Functions from Marginal Distributions with Applications to Gaussian Mixtures

Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

A Unified Perspective on Natural Gradient Variational Inference with Gaussian Mixture Models

Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

Sample-Efficient Private Learning of Mixtures of Gaussians