Oren Mangoubi,Nisheeth K. Vishnoi
Abstract:We consider the problem of sampling from a log-concave distribution $\pi(\theta) \propto e^{-f(\theta)}$ constrained to a polytope $K:=\{\theta \in \mathbb{R}^d: A\theta \leq b\}$, where $A\in \mathbb{R}^{m\times d}$ and $b \in \mathbb{R}^m$.The fastest-known algorithm \cite{mangoubi2022faster} for the setting when $f$ is $O(1)$-Lipschitz or $O(1)$-smooth runs in roughly $O(md \times md^{\omega -1})$ arithmetic operations, where the $md^{\omega -1}$ term arises because each Markov chain step requires computing a matrix inversion and determinant (here $\omega \approx 2.37$ is the matrix multiplication constant). We present a nearly-optimal implementation of this Markov chain with per-step complexity which is roughly the number of non-zero entries of $A$ while the number of Markov chain steps remains the same. The key technical ingredients are 1) to show that the matrices that arise in this Dikin walk change slowly, 2) to deploy efficient linear solvers that can leverage this slow change to speed up matrix inversion by using information computed in previous steps, and 3) to speed up the computation of the determinantal term in the Metropolis filter step via a randomized Taylor series-based estimator.
What problem does this paper attempt to address?
### The problem the paper attempts to solve
The paper aims to solve the problem of efficient sampling from a log - concave distribution defined on a polyhedron \(K\). Specifically, given a polyhedron \(K :=\{\theta\in\mathbb{R}^d:A\theta\leq b\}\), where \(A\in\mathbb{R}^{m\times d}\) and \(b\in\mathbb{R}^m\), and a convex function \(f:K\rightarrow\mathbb{R}\), the goal is to sample a point \(\theta\in K\) from the distribution \(\pi(\theta)\propto e^{-f(\theta)}\).
This problem is important in many applications, including Bayesian inference, differential privacy optimization, and integration. For example, in Bayesian Lasso logistic regression, \(f(\theta)=\sum_{i = 1}^n\ell(\theta^{\top}x_i)\), where \(\ell\) is the logistic loss function, \(x_i\) are data points and \(\|x_i\|_2\leq1\), and \(K =\{\theta\in\mathbb{R}^d:\|\theta\|_1\leq O(1)\}\) is an \(\ell_1\)-ball (see references [32, 36, 17]).
### Main contributions
The main contribution of the paper is to propose an almost optimal implementation method, reducing the per - step complexity to approximately the number of non - zero elements \(\text{nnz}(A)\) of the matrix \(A\), while keeping the number of Markov chain steps unchanged. The key technical components include:
1. **Prove that the matrix changes slowly in the Dikin walk**: Show that in each iteration, the matrix in the Dikin walk changes very slowly.
2. **Use an efficient linear solver**: Utilize this slow change to accelerate matrix inversion by using the information calculated in the previous step.
3. **Accelerate the determinant calculation in the Metropolis filtering step**: Accelerate the determinant calculation through an estimator based on the stochastic Taylor series.
These improvements directly improve the running time of sampling from the Gibbs distribution constrained to a polyhedron in applications such as Bayesian statistics and privacy optimization.
### Specific results
According to the main result (Theorem 2.1) of the paper, the proposed algorithm has a significant efficiency improvement in the following cases:
- When \(f\) is \(L\)-Lipschitz or \(\beta\)-smooth, the sampling complexity is respectively:
- \(\tilde{O}((md + dL^2R^2)\log(w/\delta))\times(\text{nnz}(A)+d^2+T_f)\) arithmetic operations for the \(L\)-Lipschitz function \(f\);
- \(\tilde{O}((md + d\beta R^2)\log(w/\delta))\times(\text{nnz}(A)+d^2+T_f)\) arithmetic operations for the \(\beta\)-smooth function \(f\).
Compared with previous algorithms, the method in this paper significantly reduces the running time in many cases, especially when \(T_f=\Omega(d^2)\), the improvement is at least \(d^{\omega - 2}\). If \(A\) is a sparse matrix (i.e., \(\text{nnz}(A)=O(m)\)), the improvement is more obvious, depending on the relationship between \(m\) and \(d\).
### Application examples
In the example of Bayesian Lasso logistic regression, when \(f\) is \(\beta\)-smooth and \(L\)-Lip