Optimal transport natural gradient for statistical manifolds with continuous sample space

Yifan Chen,Wuchen Li
2020-04-16
Abstract:We study the Wasserstein natural gradient in parametric statistical models with continuous sample spaces. Our approach is to pull back the $L^2$-Wasserstein metric tensor in the probability density space to a parameter space, equipping the latter with a positive definite metric tensor, under which it becomes a Riemannian manifold, named the Wasserstein statistical manifold. In general, it is not a totally geodesic sub-manifold of the density space, and therefore its geodesics will differ from the Wasserstein geodesics, except for the well-known Gaussian distribution case, a fact which can also be validated under our framework. We use the sub-manifold geometry to derive a gradient flow and natural gradient descent method in the parameter space. When parametrized densities lie in $\bR$, the induced metric tensor establishes an explicit formula. In optimization problems, we observe that the natural gradient descent outperforms the standard gradient descent when the Wasserstein distance is the objective function. In such a case, we prove that the resulting algorithm behaves similarly to the Newton method in the asymptotic regime. The proof calculates the exact Hessian formula for the Wasserstein distance, which further motivates another preconditioner for the optimization process. To the end, we present examples to illustrate the effectiveness of the natural gradient in several parametric statistical models, including the Gaussian measure, Gaussian mixture, Gamma distribution, and Laplace distribution.
Optimization and Control,Information Theory,Machine Learning,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to introduce the concept of optimal transport natural gradient in parametric statistical models in continuous sample spaces. Specifically, the authors studied how to use the Wasserstein distance to define the natural gradient in parametric statistical models and explored the performance of this gradient method in optimization problems. By pulling the L2 - Wasserstein metric tensor back from the probability density space to the parameter space, they made the parameter space a Riemannian manifold, so that the gradient flow and the natural gradient descent method could be defined on this manifold. ### Main problems 1. **Introduction of Wasserstein natural gradient**: - The authors attempted to introduce the Wasserstein natural gradient in parametric statistical models to overcome the limitations of the traditional Euclidean gradient descent method in some problems. - By pulling the L2 - Wasserstein metric tensor back from the probability density space to the parameter space, they constructed a new Riemannian manifold, called the Wasserstein statistical manifold. 2. **Performance improvement in optimization problems**: - The authors observed that in optimization problems with the Wasserstein distance as the objective function, the Wasserstein natural gradient descent method performs better than the traditional Euclidean gradient descent method and the Fisher - Rao natural gradient descent method. - They proved that in the asymptotic case, the behavior of the Wasserstein natural gradient descent method is similar to that of Newton's method, which further improves the optimization efficiency. ### Specific content - **Wasserstein statistical manifold**: - The authors constructed a new Riemannian manifold by pulling the L2 - Wasserstein metric tensor back from the probability density space to the parameter space. This manifold is called the Wasserstein statistical manifold. - They defined the metric tensor \( G_W(\theta) \) in the parameter space and derived its explicit form: \[ G_W(\theta) = \int_{\Omega} \frac{1}{\rho(x, \theta)} (\nabla_\theta F(x, \theta))^T \nabla_\theta F(x, \theta) \, dx, \] where \( F(y, \theta) = \int_{-\infty}^y \rho(y, \theta) \, dy \) is the cumulative distribution function. - **Gradient flow and natural gradient descent**: - The authors defined the gradient flow based on the Wasserstein statistical manifold and derived the iterative formula of the natural gradient descent method: \[ \theta_{n + 1} = \theta_n - \tau G_W(\theta_n)^{-1} \nabla_\theta R(\rho(\cdot, \theta_n)). \] - They proved that in the asymptotic case, the behavior of the Wasserstein natural gradient descent method is similar to that of Newton's method, that is: \[ \lim_{\theta \to \theta^*} G_W(\theta) = \nabla^2_\theta R(\rho(\cdot, \theta^*)). \] - **Numerical experiments**: - The authors verified the effectiveness of the Wasserstein natural gradient descent method through numerical experiments and compared it with the traditional Euclidean gradient descent method and the Fisher - Rao natural gradient descent method. - The experimental results show that the Wasserstein natural gradient descent method performs better in optimization problems, especially when the Wasserstein distance is the objective function. ### Conclusion This paper successfully introduced the concept of Wasserstein natural gradient in parametric statistical models and proved the superior performance of this method in optimization problems. By constructing the Wasserstein statistical manifold, the authors provided a new tool that can be more...