Abstract:Motivated by the computation of the non-parametric maximum likelihood estimator (NPMLE) and the Bayesian posterior in statistics, this paper explores the problem of convex optimization over the space of all probability distributions. We introduce an implicit scheme, called the implicit KL proximal descent (IKLPD) algorithm, for discretizing a continuous-time gradient flow relative to the Kullback-Leibler divergence for minimizing a convex target functional. We show that IKLPD converges to a global optimum at a polynomial rate from any initialization; moreover, if the objective functional is strongly convex relative to the KL divergence, for example, when the target functional itself is a KL divergence as in the context of Bayesian posterior computation, IKLPD exhibits globally exponential convergence. Computationally, we propose a numerical method based on normalizing flow to realize IKLPD. Conversely, our numerical method can also be viewed as a new approach that sequentially trains a normalizing flow for minimizing a convex functional with a strong theoretical guarantee.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to minimize convex functionals in the probability measure space. Specifically, the author explores minimizing an L2 - convex objective functional \(F\) over the space \(P(\Theta)\) of all probability distributions and proposes a new implicit discretization scheme - the Implicit KL Proximal Descent (IKLPD) algorithm. This algorithm is implemented based on the discretization of the Kullback - Leibler (KL) divergence gradient flow. ### Main Problems 1. **Non - parametric Maximum Likelihood Estimation (NPMLE)**: - This problem occurs when estimating the mixture distribution of a mixture model and when using the empirical Bayes method to solve compound decision problems. - The goal is to minimize the average negative log - likelihood functional \(L_n(\rho)\), where \(\rho\) is the unknown mixture distribution. - The formula is: \[ \hat{P}_n=\arg\min_{\rho\in P(\Theta)}L_n(\rho),\quad\text{with}\quad L_n(\rho):=\frac{1}{n}\sum_{i = 1}^n-\log\left(\int_\Theta p(X_i|\theta)d\rho(\theta)\right) \] - \(L_n\) is clearly L2 - convex on \(P(\Theta)\), but is usually not displacement - convex. 2. **Bayesian Posterior Sampling**: - In Bayesian statistics, the core problem is to sample from the posterior distribution of the unknown parameters to estimate the parameters and construct the corresponding confidence intervals. - The posterior distribution can be identified by minimizing the KL - divergence functional \(D_{\text{KL}}(\cdot\|\pi_n)\). - The formula is: \[ \pi_n=\arg\min_{\rho\in P(\Theta)}\int V_n(\theta)d\rho(\theta)+\int\rho\log\rho \] where \(V_n(\theta)=-\log\pi(\theta)-\sum_{i = 1}^n\log p(X_i|\theta)\). ### Paper Contributions - Proposed the Implicit KL Proximal Descent (IKLPD) algorithm for discretizing the continuous - time gradient flow related to the KL - divergence to minimize the general L2 - convex functional \(F\). - Proved that under only the L2 - convexity condition, IKLPD can converge to the global optimal solution from any initialization point, and if \(F\) is strongly convex with respect to the KL - divergence, then IKLPD exhibits global exponential convergence. - Proposed a numerical method based on the normalization flow to implement IKLPD, which can train the normalization flow layer by layer to minimize the convex functional \(F\). - Analyzed the convergence of the inexact IKLPD with non - zero numerical errors and the stochastic version of IKLPD. Through these methods, the paper provides a new, theoretically - guaranteed method to minimize convex functionals in the probability measure space, especially performing well when dealing with NPMLE and Bayesian posterior calculations.

Minimizing Convex Functionals over Space of Probability Measures via KL Divergence Gradient Flow

Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

Accelerated gradient descent method for functionals of probability measures by new convexity and smoothness based on transport maps

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Efficient, multimodal, and derivative-free Bayesian inference with Fisher-Rao gradient flows

Wasserstein gradient flow for optimal probability measure decomposition

Sampling via Gradient Flows in the Space of Probability Measures

Fisher-Rao Gradient Flow: Geodesic Convexity and Functional Inequalities

Accelerating optimization over the space of probability measures

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimiax Optimization

Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization

Minimizing $f$-Divergences by Interpolating Velocity Fields

Kernel Approximation of Fisher-Rao Gradient Flows

Large-Scale Wasserstein Gradient Flows

Linear convergence of proximal descent schemes on the Wasserstein space

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization

Deterministic Langevin Unconstrained Optimization with Normalizing Flows

The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program

Mean field approximations via log-concavity