Abstract:Gaussian processes (GPs) are commonly used for geospatial analysis, but they suffer from high computational complexity when dealing with massive data. For instance, the log-likelihood function required in estimating the statistical model parameters for geospatial data is a computationally intensive procedure that involves computing the inverse of a covariance matrix with size n X n, where n represents the number of geographical locations. As a result, in the literature, studies have shifted towards approximation methods to handle larger values of n effectively while maintaining high accuracy. These methods encompass a range of techniques, including low-rank and sparse approximations. Vecchia approximation is one of the most promising methods to speed up evaluating the log-likelihood function. This study presents a parallel implementation of the Vecchia approximation, utilizing batched matrix computations on contemporary GPUs. The proposed implementation relies on batched linear algebra routines to efficiently execute individual conditional distributions in the Vecchia algorithm. We rely on the KBLAS linear algebra library to perform batched linear algebra operations, reducing the time to solution compared to the state-of-the-art parallel implementation of the likelihood estimation operation in the ExaGeoStat software by up to 700X, 833X, 1380X on 32GB GV100, 80GB A100, and 80GB H100 GPUs, respectively. We also successfully manage larger problem sizes on a single NVIDIA GPU, accommodating up to 1M locations with 80GB A100 and H100 GPUs while maintaining the necessary application accuracy. We further assess the accuracy performance of the implemented algorithm, identifying the optimal settings for the Vecchia approximation algorithm to preserve accuracy on two real geospatial datasets: soil moisture data in the Mississippi Basin area and wind speed data in the Middle East.

Parallel Gaussian process with kernel approximation in CUDA

Implementation and analysis of GPU algorithms for Vecchia Approximation

Generating Approximate Inverse Preconditioners for Sparse Matrices Using CUDA and GPGPU

Parallel cross-validation: A scalable fitting method for Gaussian process models

A parallel implementation of nearest neighbor analysis based on GPGPU

Large-Scale Gaussian Processes via Alternating Projection

Parallel GPU Implementation of Iterative PCA Algorithms

Asynchronous Parallel Large-Scale Gaussian Process Regression

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

Nearest Neighbors GParareal: Improving Scalability of Gaussian Processes for Parallel-in-Time Solvers

Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data

GPU-Accelerated Vecchia Approximations of Gaussian Processes for Geospatial Data using Batched Matrix Computations

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units.

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

SKIing on Simplices: Kernel Interpolation on the Permutohedral Lattice for Scalable Gaussian Processes

Scaling Gaussian Process Regression with Derivatives

GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration

A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian Processes

Parallel calculation of the median and order statistics on GPUs with application to robust regression

Parallel optimization for sparse matrix-vector on GPU

Uniform approximation of common Gaussian process kernels using equispaced Fourier grids