Parallel cross-validation: A scalable fitting method for Gaussian process models

Florian Gerber,Douglas W. Nychka
DOI: https://doi.org/10.1016/j.csda.2020.107113
2021-03-01
Abstract:<p>Gaussian process (GP) models are widely used to analyze spatially referenced data and to predict values at locations without observations. They are based on a statistical framework, which enables uncertainty quantification of the model structure and predictions. Both the evaluation of the likelihood and the prediction involve solving linear systems. Hence, the computational costs are large and limit the amount of data that can be handled. While there are many approximation strategies that lower the computational cost of GP models, they often provide sub-optimal support for the parallel computing capabilities of (high-performance) computing environments. To bridge this gap a parallelizable parameter estimation and prediction method is presented. The key idea is to divide the spatial domain into overlapping subsets and to use cross-validation (CV) to estimate the covariance parameters in parallel. Although simulations show that CV is less effective for parameter estimation than the maximum likelihood method, it is amenable to parallel computing and enables the handling of large datasets. Exploiting the screen effect for spatial prediction helps to arrive at a spatial analysis that is close to a global computation despite performing parallel computations on local regions. Simulation studies assess the accuracy of the parameter estimates and predictions. The implementation shows good weak and strong parallel scaling properties. For illustration, an exponential covariance model is fitted to a scientifically relevant canopy height dataset with 5 million observations. Using 512 processor cores in parallel brings the evaluation time of one covariance parameter configuration to 1.5 minutes.</p>
statistics & probability,computer science, interdisciplinary applications
What problem does this paper attempt to address?