Scalable Bayesian Optimization Using Vecchia Approximations of Gaussian Processes

Felix Jimenez,Matthias Katzfuss
DOI: https://doi.org/10.48550/arXiv.2203.01459
2022-03-03
Abstract:Bayesian optimization is a technique for optimizing black-box target functions. At the core of Bayesian optimization is a surrogate model that predicts the output of the target function at previously unseen inputs to facilitate the selection of promising input values. Gaussian processes (GPs) are commonly used as surrogate models but are known to scale poorly with the number of observations. We adapt the Vecchia approximation, a popular GP approximation from spatial statistics, to enable scalable high-dimensional Bayesian optimization. We develop several improvements and extensions, including training warped GPs using mini-batch gradient descent, approximate neighbor search, and selecting multiple input values in parallel. We focus on the use of our warped Vecchia GP in trust-region Bayesian optimization via Thompson sampling. On several test functions and on two reinforcement-learning problems, our methods compared favorably to the state of the art.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the computational efficiency issue of Bayesian Optimization (BO) when dealing with high - dimensional spaces and a large amount of observational data. Specifically, Bayesian Optimization relies on Gaussian Processes (GPs) as surrogate models, but traditional GP methods are too computationally costly when facing large - scale data sets and are difficult to scale to high - dimensional and large - number - of - observations situations. ### Main problems: 1. **Optimization in high - dimensional spaces**: When the input space dimension is high, optimizing the acquisition function becomes very expensive. 2. **Processing a large amount of observational data**: As the amount of observational data increases, the computational complexity of traditional GP methods increases rapidly, resulting in overly long computation times. ### Solutions: To solve these problems, the author introduced the Vecchia approximation, an efficient GP approximation method from the field of spatial statistics. Through the following improvements and extensions, the author enabled the Vecchia approximation to be applied to Bayesian Optimization in high - dimensional spaces and with a large amount of observational data: - **Warped Kernels**: Use non - linear transformations (such as Kumarswamy CDF) to enhance the flexibility of the GP kernel. - **Mini - batch Gradient Descent**: Train by mini - batch gradient descent to reduce the computational burden. - **Approximate Nearest Neighbors**: Utilize approximate nearest - neighbor search to accelerate the computation. - **Variance Correction**: Calibrate the prediction variance to improve the effect of uncertainty quantification. ### Experimental results: The author verified their method on multiple test functions and reinforcement learning tasks and compared it with existing methods. The results show that the GP based on the Vecchia approximation often outperforms or is at least not inferior to other GP surrogate models, including the exact GP. Especially in high - dimensional spaces and with a large amount of observational data, the Vecchia approximation significantly improves computational efficiency while maintaining high optimization accuracy. ### Key formulas: - Joint density decomposition under the Vecchia approximation: \[ \hat{p}(y_{1:n})=\prod_{i = 1}^n p(y_i|y_{c(i)}) \] where \(c(i)\subset\{1,\ldots,i - 1\}\) is a conditional index set with a size of at most \(m\). - Vecchia approximation of the posterior predictive distribution: \[ \hat{p}(y_p|y_{1:n})=\prod_{i = 1}^{n_p} p(y_p^{(i)}|y_{cp(i)})=\mathcal{N}(\mu_p,(LTL)^{-1}) \] Through these improvements, the Vecchia approximation not only improves the computational efficiency of Bayesian Optimization but also shows potential in dealing with complex, high - dimensional optimization problems.