Abstract:On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can generate extremely ill-conditioned basis vectors, and to maintain stability in practice, a conservatively small step size is used, which limits the performance of the $s$-step solver. To enhance the performance using a small step size, in this paper, we introduce a two-stage block orthogonalization scheme. Similar to the original scheme, the first stage of the proposed method operates on a block of $s$ basis vectors at a time, but its objective is to maintain the well-conditioning of the generated basis vectors with a lower cost. The orthogonalization of the basis vectors is delayed until the second stage when enough basis vectors are generated to obtain higher performance. Our analysis shows the stability of the proposed two-stage scheme. The performance is improved because while the same amount of computation as the original scheme is required, most of the communication is done at the second stage of the proposed scheme, reducing the overall communication requirements. Our performance results with up to 192 NVIDIA V100 GPUs on the Summit supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage approach can reduce the orthogonalization time and the total time-to-solution by the respective factors of up to $2.6\times$ and $1.6\times$ over the original $s$-step GMRES, which had already obtained the respective speedups of $2.1\times$ and $1.8\times$ over the standard GMRES. Similar speedups were obtained for 3D problems and for matrices from the SuiteSparse Matrix Collection.

Preconditioned GMRES Methods with Incomplete Givens Orthogonalization Method for Large Sparse Least-Squares Problems

On IGMRES: an Incomplete Generalized Minimal Residual Method for Large Unsymmetric Linear Systems

Convergence analysis of inner-iteration preconditioned GMRES

Preconditioning Low Rank Generalized Minimal Residual Method (GMRES) for Implicit Discretizations of Matrix Differential Equations

Utilizing Cuda For Preconditioned Gmres Solvers

Right preconditioned GMRES for arbitrary singular systems

Incomplete Factorization Preconditioning for Linear Least Squares Problems

Projection Improved SPAI Preconditioner for FGMRES

Preprocessed GMRES for fast solution of linear equations

Optimal Solutions of Well-Posed Linear Systems via Low-Precision Right-Preconditioned GMRES with Forward and Backward Stabilization

Block-splitting preconditioners for indefinite least squares problem

Learning incomplete factorization preconditioners for GMRES

Improving the Performance of the GMRES Method using Mixed-Precision Techniques

A spectrally preconditioned and initially deflated variant of the restarted block GMRES method for solving multiple right-hand sides linear systems

Graph Neural Preconditioners for Iterative Solutions of Sparse Linear Systems

Hermitian Preconditioning for a class of Non-Hermitian Linear Systems

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

New choices of preconditioning matrices for generalized inexact parameterized iterative methods

CIMGS: an Incomplete Orthogonal FactorizationPreconditioner

Specifying Gaussian Markov Random Fields with Incomplete Orthogonal Factorization using Givens Rotations

Flexible and deflated variants of the block shifted GMRES method