Paul Fearnhead,Sebastiano Grazzi,Chris Nemeth,Gareth O. Roberts
Abstract:Recent work has suggested using Monte Carlo methods based on piecewise deterministic Markov processes (PDMPs) to sample from target distributions of interest. PDMPs are non-reversible continuous-time processes endowed with momentum, and hence can mix better than standard reversible MCMC samplers. Furthermore, they can incorporate exact sub-sampling schemes which only require access to a single (randomly selected) data point at each iteration, yet without introducing bias to the algorithm's stationary distribution. However, the range of models for which PDMPs can be used, particularly with sub-sampling, is limited. We propose approximate simulation of PDMPs with sub-sampling for scalable sampling from posterior distributions. The approximation takes the form of an Euler approximation to the true PDMP dynamics, and involves using an estimate of the gradient of the log-posterior based on a data sub-sample. We thus call this class of algorithms stochastic-gradient PDMPs. Importantly, the trajectories of stochastic-gradient PDMPs are continuous and can leverage recent ideas for sampling from measures with continuous and atomic components. We show these methods are easy to implement, present results on their approximation error and demonstrate numerically that this class of algorithms has similar efficiency to, but is more robust than, stochastic gradient Langevin dynamics.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform efficient and accurate Bayesian posterior sampling on large - scale datasets. Specifically, traditional Markov Chain Monte Carlo (MCMC) methods are inefficient when dealing with large datasets because each iteration requires calculating the log - posterior density over the entire dataset, which usually grows linearly with the amount of data. To solve this problem, Monte Carlo methods based on Piecewise - Deterministic Markov Processes (PDMPs) have been proposed in recent years. These methods can use subsamples for unbiased estimation, thereby improving computational efficiency. However, the application scope of PDMPs is limited, especially in cases where subsamples need to be used.
For this reason, the paper proposes a new method - Stochastic Gradient PDMPs (SG - PDMPs). By using subsamples and Euler approximation to simulate the dynamics of PDMPs, scalable sampling from the posterior distribution can be achieved. This method not only retains the advantages of PDMPs, such as non - reversibility and continuous trajectories, but also improves computational efficiency by using subsamples of a single data point, thereby achieving more efficient sampling on large - scale datasets.
### Main contributions
1. **Propose SG - PDMPs**: By selecting a random subsample within each time interval and using Euler approximation to simulate the dynamics of PDMPs, scalable sampling from the posterior distribution can be achieved.
2. **Theoretical analysis**: It is proved that the distribution of SG - PDMPs within a fixed time interval is an O(ε) approximation of the true PDMP distribution, and the sources of the approximation error are analyzed.
3. **Numerical experiments**: Through experiments on linear regression and logistic regression models, the advantages of SG - PDMPs over the Stochastic Gradient Langevin Dynamics (SGLD) algorithm are demonstrated, especially in cases of high - dimensionality and large - step - size discretization.
4. **Extended applications**: It is discussed how to apply SG - PDMPs to models containing variable selection, and how to handle discontinuities in the target distribution by introducing "sticky" components.
### Mathematical formulas
- **Target density**:
\[
\pi(x)\propto\exp(-U(x)),\quad U(x)=\sum_{j = 1}^{N}U_j(x)
\]
where \(U_j(x)\) is the potential energy function related to the \(j\)-th data point.
- **Event rate**:
\[
\lambda(z)=\frac{1}{N}\sum_{j = 1}^{N}\lambda_j(z)
\]
where \(\lambda_j(z)\) satisfies:
\[
\lambda_j(x,F_j^x(v))-\lambda_j(x,v)=v\cdot\nabla U_j(x)
\]
- **Reflection operation**:
\[
F_j^x(v)=v-\frac{v\cdot\nabla\hat{U}_j(x)}{\|\nabla\hat{U}_j(x)\|^2}\nabla\hat{U}_j(x)
\]
where \(\hat{U}_j(x)\) is a transformed form of \(U_j(x)\) used to reduce the variance of gradient estimation.
- **Zig - Zag sampler's event rate**:
\[
\lambda_{i,j}(z)=\max(v\cdot\partial_x\hat{U}_j(x),0)
\]
where \(\partial_x\hat{U}_j(x)\) represents the partial derivative of \(\hat{U}_j(x)\) with respect to \(x\).
- **Approximation error**:
\[
\left\|P_T(z,\cdot)-\tilde{P}_T(z,\cdot)\right\|_{\text{TV}}\leq C(z,T)\epsilon
\]
where \(P_T(z,\cdot)\) and \(\tilde{P}_T(z,\cdot)\) are respectively.