Geodesic Sinkhorn for Fast and Accurate Optimal Transport on Manifolds

Guillaume Huguet,Alexander Tong,María Ramos Zapatero,Christopher J. Tape,Guy Wolf,Smita Krishnaswamy
2023-09-26
Abstract:Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state-of-the-art for such computations, but require $O(n^2)$ computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn -- based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only $O(n\log n)$ computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular, we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the efficiency and accuracy of calculating the Optimal Transport (OT) distance on manifolds. Specifically, although the existing Sinkhorn method performs well in calculating the OT distance, it has two main problems: 1. **High computational complexity**: The traditional Sinkhorn method requires a computational complexity of \(O(n^2)\), which is very time - consuming for large - scale data sets. 2. **Limitations of Euclidean distance**: Existing methods usually use Euclidean distance as the basic distance metric, but when dealing with high - dimensional data, especially when the data is assumed to be on a low - dimensional manifold, Euclidean distance may not be the optimal choice. To solve these problems, the authors proposed the Geodesic Sinkhorn method. This method defines the geodesic distance based on the heat kernel diffusion on the graph and uses Chebyshev polynomials to approximate the heat kernel, thereby reducing the computational complexity to \(O(n \log n)\) and significantly improving the computational efficiency. In addition, Geodesic Sinkhorn can better capture the intrinsic geometric structure of the data, thereby improving the accuracy of OT distance calculation. ### Main contributions 1. **Efficient calculation of OT distance**: A new method, Geodesic Sinkhorn, was proposed for fast and accurate calculation of the OT distance on manifolds, which is efficient in terms of time and memory. 2. **Definition of Barycentric Distance**: A new distance metric between distribution families was introduced, and its application value in analyzing treatment effects, etc. was demonstrated. ### Application scenarios This method was applied to the analysis of single - cell data, especially in evaluating the impact of chemotherapy on patient - derived cancer organoids (PDOs). By comparing the barycenters under different treatment methods, the treatment effect can be quantified more accurately, and it can be identified whether the drug combination has a synergistic effect. ### Summary Geodesic Sinkhorn solves the shortcomings of the traditional Sinkhorn method in terms of computational complexity and distance metric by combining heat kernel diffusion and Chebyshev polynomial approximation, providing a more efficient and accurate solution for OT distance calculation of large - scale high - dimensional data.