Abstract:Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $ \mathrm{KL}(\pi \| \mu) $ with respect to $ \mu $ for some target $ \pi $, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to construct and understand the general approximate inference framework for minimizing the inclusive KL divergence from the perspective of mathematical analysis, especially by using the gradient flow theory?** Specifically, the paper focuses on minimizing $ \text{KL}(\pi | \mu) $, that is, given the target distribution $\pi$, finding the optimal distribution $\mu$ such that the KL divergence between the two is minimized. ### Main contributions of the paper 1. **Revealed the fundamental connection between inclusive KL divergence minimization and methods widely used in sampling, inference, and generative models**: - In particular, the MMD (Maximum Mean Discrepancy) minimization problem. The paper shows that the latter can approximate the former through convolution or smoothing, and thus can be interpreted within the strict framework of the PDE gradient flow system. 2. **Extended the research on Fisher - Rao gradient flow**: - Discovered that the Fisher - Rao flow can be implemented as an MMD - MMD flow, and provided theoretical and practical implications. 3. **Combined Wasserstein and Fisher - Rao theories to study the Wasserstein - Fisher - Rao gradient flow of inclusive KL divergence**: - Revealed its unique properties and its equivalence to existing algorithm implementations. 4. **Provided the gradient flow theoretical basis for inclusive KL inference for the first time**: - This theory fills an important gap in the fields of Bayesian statistics and generative modeling, and provides principled guidance for future research. ### Research background - **Exclusive KL divergence**: Existing research has mainly focused on minimizing $ \text{KL}(\mu | \pi) $ and has carried out in - depth analysis using PDE gradient flow theory and statistical optimal transport theory. - **Inclusive KL divergence**: In contrast, there is less research on minimizing $ \text{KL}(\pi | \mu) $, and it lacks a solid mathematical analysis foundation. ### Methodological innovation - **Gradient flow perspective**: The paper provides a unified framework to understand and analyze the inclusive KL divergence minimization problem by introducing the Wasserstein - Fisher - Rao gradient flow. - **Application of kernel methods**: By introducing kernel functions (such as the Gaussian kernel), the paper proposes a smoothed gradient flow equation, which solves the non - smooth problem in the original equation and makes it easier to implement and apply. ### Practical applications - **MMD minimization and Kernel Stein Discrepancy (KSD) minimization**: These methods have already performed excellently in practical applications, especially in cases where direct sampling from the target distribution is not required. - **Accelerating and improving MMD minimization tasks**: By introducing the Interaction - Force - Transport (IFT) gradient flow, the performance of MMD minimization is further enhanced. ### Summary This paper not only proposes a new perspective to understand existing inference and sampling algorithms but also provides a solid theoretical basis for the inclusive KL divergence minimization problem. This will help promote the further development of the fields of Bayesian inference and generative modeling.

Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

Minimizing Convex Functionals over Space of Probability Measures via KL Divergence Gradient Flow

Kernel Approximation of Fisher-Rao Gradient Flows

Large-Scale Wasserstein Gradient Flows

Fisher-Rao Gradient Flow: Geodesic Convexity and Functional Inequalities

Sampling via Gradient Flows in the Space of Probability Measures

Bridging the Gap Between Variational Inference and Wasserstein Gradient Flows

Wasserstein gradient flow for optimal probability measure decomposition

Efficient, multimodal, and derivative-free Bayesian inference with Fisher-Rao gradient flows

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimiax Optimization

Sequential Monte Carlo for Inclusive KL Minimization in Amortized Variational Inference

Wasserstein Gradient Flow over Variational Parameter Space for Variational Inference

Learning with minibatch Wasserstein : asymptotic and gradient properties

Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Iterated Schrödinger bridge approximation to Wasserstein Gradient Flows

A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference

Solving Fredholm Integral Equations of the First Kind via Wasserstein Gradient Flows

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization

Mean-field Variational Inference via Wasserstein Gradient Flow