Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES
Ichitaro Yamazaki,Andrew J. Higgins,Erik G. Boman,Daniel B. Szyld
2024-02-23
Abstract:On current computer architectures, GMRES' performance can be limited by its
communication cost to generate orthonormal basis vectors of the Krylov
subspace. To address this performance bottleneck, its $s$-step variant
orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the
communication cost by a factor of $s$. Unfortunately, for a large step size
$s$, the solver can generate extremely ill-conditioned basis vectors, and to
maintain stability in practice, a conservatively small step size is used, which
limits the performance of the $s$-step solver. To enhance the performance using
a small step size, in this paper, we introduce a two-stage block
orthogonalization scheme. Similar to the original scheme, the first stage of
the proposed method operates on a block of $s$ basis vectors at a time, but its
objective is to maintain the well-conditioning of the generated basis vectors
with a lower cost. The orthogonalization of the basis vectors is delayed until
the second stage when enough basis vectors are generated to obtain higher
performance.
Our analysis shows the stability of the proposed two-stage scheme. The
performance is improved because while the same amount of computation as the
original scheme is required, most of the communication is done at the second
stage of the proposed scheme, reducing the overall communication requirements.
Our performance results with up to 192 NVIDIA V100 GPUs on the Summit
supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage
approach can reduce the orthogonalization time and the total time-to-solution
by the respective factors of up to $2.6\times$ and $1.6\times$ over the
original $s$-step GMRES, which had already obtained the respective speedups of
$2.1\times$ and $1.8\times$ over the standard GMRES. Similar speedups were
obtained for 3D problems and for matrices from the SuiteSparse Matrix
Collection.
Numerical Analysis,Distributed; Parallel; and Cluster Computing