A stable one-synchronization variant of reorthogonalized block classical Gram--Schmidt

Erin Carson,Yuxin Ma
2024-11-11
Abstract:The block classical Gram--Schmidt (BCGS) algorithm and its reorthogonalized variant are widely-used methods for computing the economic QR factorization of block columns $X$ due to their lower communication cost compared to other approaches such as modified Gram--Schmidt and Householder QR. To further reduce communication, i.e., synchronization, there has been a long ongoing search for a variant of reorthogonalized BCGS variant that achieves $O(u)$ loss of orthogonality while requiring only \emph{one} synchronization point per block column, where $u$ represents the unit roundoff. Utilizing Pythagorean inner products and delayed normalization techniques, we propose the first provably stable one-synchronization reorthogonalized BCGS variant, demonstrating that it has $O(u)$ loss of orthogonality under the condition $O(u) \kappa^2(X) \leq 1/2$, where $\kappa(\cdot)$ represents the condition number. By incorporating one additional synchronization point, we develop a two-synchronization reorthogonalized BCGS variant which maintains $O(u)$ loss of orthogonality under the improved condition $O(u) \kappa(X) \leq 1/2$. An adaptive strategy is then proposed to combine these two variants, ensuring $O(u)$ loss of orthogonality while using as few synchronization points as possible under the less restrictive condition $O(u) \kappa(X) \leq 1/2$. As an example of where this adaptive approach is beneficial, we show that using the adaptive orthogonalization variant, $s$-step GMRES achieves a backward error comparable to $s$-step GMRES with BCGSI+, also known as BCGS2, both theoretically and numerically, but requires fewer synchronization points.
Numerical Analysis
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to reduce the number of synchronization points in the Blocked Classical Gram - Schmidt (BCGS) algorithm while maintaining numerical stability. Specifically, the author proposes a new one - sync reorthogonalized BCGS variant and proves that it can achieve the same level of Loss of Orthogonality (LOO) \(O(u)\) as existing methods under certain conditions, where \(u\) represents the unit round - off error. ### Main Problems and Goals 1. **Reduce Synchronization Points**: The traditional BCGS algorithm requires two synchronization points for each block column processed, while reorthogonalized variants (such as BCGSI +) require four synchronization points. Too many synchronization points will increase communication overhead and reduce algorithm performance. Therefore, researchers have been looking for a method that can use only one or two synchronization points in each iteration. 2. **Maintain Numerical Stability**: Although reducing synchronization points can improve performance, it is necessary to ensure that the numerical stability of the algorithm is not affected. In particular, the Loss of Orthogonality (LOO) should be controlled at the \(O(u)\) level. 3. **Adapt to Matrices with Different Condition Numbers**: For matrices with large condition numbers, some existing low - synchronization - point methods may not be able to guarantee the \(O(u)\) LOO. Therefore, an adaptive strategy needs to be designed to use the one - sync - point method when the condition number is small and switch to the two - sync - point method when the condition number is large. ### Solutions 1. **One - Sync Reorthogonalized BCGS (BCGSI+P - 1S)**: - Using the Pythagorean inner product and delayed normalization techniques, a new one - sync reorthogonalized BCGS variant is proposed. - It is proved that under the condition \(O(u)\kappa^2(X)\leq1/2\), this method can achieve the \(O(u)\) LOO. 2. **Two - Sync Reorthogonalized BCGS (BCGSI+P - 2S)**: - An additional synchronization point is introduced, so that the algorithm can maintain the \(O(u)\) LOO under the more relaxed condition \(O(u)\kappa(X)\leq1/2\). - A more stable internal orthogonalization method (such as Householder QR or TSQR) is used to replace the original Pythagorean - based Cholesky QR. 3. **Adaptive Orthogonalization Strategy**: - An adaptive method is proposed, mainly using BCGSI+P - 1S and switching to BCGSI+P - 2S when the condition number exceeds a certain threshold. - The switching condition is determined by checking whether \(O(u)\kappa^2(X_k)>1\) or verifying whether \(U_k\) is well - conditioned. ### Summary This paper aims to solve the performance bottleneck problem caused by too many synchronization points in the traditional BCGS algorithm by introducing new one - sync - point and two - sync - point reorthogonalized BCGS variants and an adaptive orthogonalization strategy, while ensuring numerical stability. These improvements are of great significance for improving the efficiency and stability of Krylov subspace methods (such as s - step GMRES) in large - scale parallel computing environments.