Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

Jia-Jie Zhu
2024-11-01
Abstract:Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $ \mathrm{KL}(\pi \| \mu) $ with respect to $ \mu $ for some target $ \pi $, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **How to construct and understand the general approximate inference framework for minimizing the inclusive KL divergence from the perspective of mathematical analysis, especially by using the gradient flow theory?** Specifically, the paper focuses on minimizing \( \text{KL}(\pi | \mu) \), that is, given the target distribution \(\pi\), finding the optimal distribution \(\mu\) such that the KL divergence between the two is minimized. ### Main contributions of the paper 1. **Revealed the fundamental connection between inclusive KL divergence minimization and methods widely used in sampling, inference, and generative models**: - In particular, the MMD (Maximum Mean Discrepancy) minimization problem. The paper shows that the latter can approximate the former through convolution or smoothing, and thus can be interpreted within the strict framework of the PDE gradient flow system. 2. **Extended the research on Fisher - Rao gradient flow**: - Discovered that the Fisher - Rao flow can be implemented as an MMD - MMD flow, and provided theoretical and practical implications. 3. **Combined Wasserstein and Fisher - Rao theories to study the Wasserstein - Fisher - Rao gradient flow of inclusive KL divergence**: - Revealed its unique properties and its equivalence to existing algorithm implementations. 4. **Provided the gradient flow theoretical basis for inclusive KL inference for the first time**: - This theory fills an important gap in the fields of Bayesian statistics and generative modeling, and provides principled guidance for future research. ### Research background - **Exclusive KL divergence**: Existing research has mainly focused on minimizing \( \text{KL}(\mu | \pi) \) and has carried out in - depth analysis using PDE gradient flow theory and statistical optimal transport theory. - **Inclusive KL divergence**: In contrast, there is less research on minimizing \( \text{KL}(\pi | \mu) \), and it lacks a solid mathematical analysis foundation. ### Methodological innovation - **Gradient flow perspective**: The paper provides a unified framework to understand and analyze the inclusive KL divergence minimization problem by introducing the Wasserstein - Fisher - Rao gradient flow. - **Application of kernel methods**: By introducing kernel functions (such as the Gaussian kernel), the paper proposes a smoothed gradient flow equation, which solves the non - smooth problem in the original equation and makes it easier to implement and apply. ### Practical applications - **MMD minimization and Kernel Stein Discrepancy (KSD) minimization**: These methods have already performed excellently in practical applications, especially in cases where direct sampling from the target distribution is not required. - **Accelerating and improving MMD minimization tasks**: By introducing the Interaction - Force - Transport (IFT) gradient flow, the performance of MMD minimization is further enhanced. ### Summary This paper not only proposes a new perspective to understand existing inference and sampling algorithms but also provides a solid theoretical basis for the inclusive KL divergence minimization problem. This will help promote the further development of the fields of Bayesian inference and generative modeling.