Abstract:Bayesian optimization is a technique for optimizing black-box target functions. At the core of Bayesian optimization is a surrogate model that predicts the output of the target function at previously unseen inputs to facilitate the selection of promising input values. Gaussian processes (GPs) are commonly used as surrogate models but are known to scale poorly with the number of observations. We adapt the Vecchia approximation, a popular GP approximation from spatial statistics, to enable scalable high-dimensional Bayesian optimization. We develop several improvements and extensions, including training warped GPs using mini-batch gradient descent, approximate neighbor search, and selecting multiple input values in parallel. We focus on the use of our warped Vecchia GP in trust-region Bayesian optimization via Thompson sampling. On several test functions and on two reinforcement-learning problems, our methods compared favorably to the state of the art.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the computational efficiency issue of Bayesian Optimization (BO) when dealing with high - dimensional spaces and a large amount of observational data. Specifically, Bayesian Optimization relies on Gaussian Processes (GPs) as surrogate models, but traditional GP methods are too computationally costly when facing large - scale data sets and are difficult to scale to high - dimensional and large - number - of - observations situations. ### Main problems: 1. **Optimization in high - dimensional spaces**: When the input space dimension is high, optimizing the acquisition function becomes very expensive. 2. **Processing a large amount of observational data**: As the amount of observational data increases, the computational complexity of traditional GP methods increases rapidly, resulting in overly long computation times. ### Solutions: To solve these problems, the author introduced the Vecchia approximation, an efficient GP approximation method from the field of spatial statistics. Through the following improvements and extensions, the author enabled the Vecchia approximation to be applied to Bayesian Optimization in high - dimensional spaces and with a large amount of observational data: - **Warped Kernels**: Use non - linear transformations (such as Kumarswamy CDF) to enhance the flexibility of the GP kernel. - **Mini - batch Gradient Descent**: Train by mini - batch gradient descent to reduce the computational burden. - **Approximate Nearest Neighbors**: Utilize approximate nearest - neighbor search to accelerate the computation. - **Variance Correction**: Calibrate the prediction variance to improve the effect of uncertainty quantification. ### Experimental results: The author verified their method on multiple test functions and reinforcement learning tasks and compared it with existing methods. The results show that the GP based on the Vecchia approximation often outperforms or is at least not inferior to other GP surrogate models, including the exact GP. Especially in high - dimensional spaces and with a large amount of observational data, the Vecchia approximation significantly improves computational efficiency while maintaining high optimization accuracy. ### Key formulas: - Joint density decomposition under the Vecchia approximation: \[ \hat{p}(y_{1:n})=\prod_{i = 1}^n p(y_i|y_{c(i)}) \] where \(c(i)\subset\{1,\ldots,i - 1\}\) is a conditional index set with a size of at most \(m\). - Vecchia approximation of the posterior predictive distribution: \[ \hat{p}(y_p|y_{1:n})=\prod_{i = 1}^{n_p} p(y_p^{(i)}|y_{cp(i)})=\mathcal{N}(\mu_p,(LTL)^{-1}) \] Through these improvements, the Vecchia approximation not only improves the computational efficiency of Bayesian Optimization but also shows potential in dealing with complex, high - dimensional optimization problems.

Scalable Bayesian Optimization Using Vecchia Approximations of Gaussian Processes

Approximation-Aware Bayesian Optimization

Bayesian Optimization with High-Dimensional Outputs

Provably Efficient Bayesian Optimization with Unbiased Gaussian Process Hyperparameter Estimation

Scaling Gaussian Process Regression with Derivatives

Pseudo-Bayesian Optimization

Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times

Vecchia Gaussian Processes: Probabilistic Properties, Minimax Rates and Methodological Developments

Enhancing Gaussian Process Surrogates for Optimization and Posterior Approximation via Random Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Practical Bayesian Optimization of Machine Learning Algorithms

Multiple Adaptive Bayesian Linear Regression for Scalable Bayesian Optimization with Warm Start

Exploiting gradients and Hessians in Bayesian optimization and Bayesian quadrature

Global Optimization of Gaussian processes

Bayesian Optimization using Deep Gaussian Processes

Scalable GAM using sparse variational Gaussian processes

Scalable Gaussian process-based transfer surrogates for hyperparameter optimization

Iterative Construction of Gaussian Process Surrogate Models for Bayesian Inference

Mastering the exploration-exploitation trade-off in Bayesian Optimization

Function Optimization with Posterior Gaussian Derivative Process