Implementing Communication-Optimal Parallel and Sequential QR Factorizations

James Demmel,Laura Grigori,Mark Hoemmen,Julien Langou
DOI: https://doi.org/10.48550/arXiv.0809.2407
2008-09-15
Abstract:We present parallel and sequential dense QR factorization algorithms for tall and skinny matrices and general rectangular matrices that both minimize communication, and are as stable as Householder QR. The sequential and parallel algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example up to 6.7x over ScaLAPACK. The parallel algorithm for general rectangular matrices is estimated to show significant speedups over ScaLAPACK, up to 22x over ScaLAPACK.
Numerical Analysis
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to design and implement a parallel and sequential dense QR decomposition algorithm for optimized communication, which is suitable for "tall and skinny" matrices (i.e., matrices where the number of rows is much greater than the number of columns) and general rectangular matrices. Specifically, the author hopes that the developed algorithm can not only minimize communication overhead (referring to network messages in the parallel case and data movement between different memory levels in the sequential case), but also be as numerically stable as Householder QR (i.e., having norm - backward stability). Through this method, the new algorithm can theoretically and practically outperform existing algorithms, such as LAPACK and ScaLAPACK, thereby significantly improving the computational speed. ### Overview of Main Problems 1. **Minimizing Communication Overhead**: Communication refers to messages sent over the network in parallel computing and the movement of data between different memory levels in sequential computing. Existing algorithms such as LAPACK and ScaLAPACK have large communication overhead when dealing with large - scale matrices, which affects the overall performance. 2. **Maintaining Numerical Stability**: The new algorithm needs to maintain the same numerical stability as Householder QR to ensure the accuracy of the results. This is especially important for some applications with high precision requirements (such as eigenvalue calculation). 3. **Improving Computational Efficiency**: By optimizing communication, the new algorithm can significantly accelerate the QR decomposition process in practical applications. Especially when dealing with "tall and skinny" matrices, it can achieve a higher speed improvement than existing algorithms. ### Specific Application Scenarios - **Block - Iterative Methods**: Such as methods for solving linear systems \( A x = B \) like GMRES, QMR or CG, and block - iterative eigenvalue solvers (such as Thick Restart Lanczos, Block Lanczos, etc.). - **Krylov Subspace Methods**: Especially s - step Krylov methods, which improve efficiency by reducing communication. - **Large - Scale Eigenvalue Calculation**: For large - scale eigenvalue problems, stable QR decomposition is crucial. ### Solutions The paper proposes two main algorithms: 1. **Tall Skinny QR (TSQR)**: For "tall and skinny" matrices, a one - dimensional block - row layout is adopted. TSQR gradually reduces communication overhead and maintains numerical stability by organizing the QR decomposition into a tree - like structure. 2. **Communication - Avoiding QR (CAQR)**: For general rectangular matrices, a two - dimensional block - cyclic layout is adopted. CAQR uses TSQR as its panel decomposition step, thereby eliminating the latency bottleneck in the parallel case and the bandwidth bottleneck in the sequential case. Through these improvements, the new algorithm is theoretically communication - optimal (ignoring polynomial - logarithmic factors) and shows a significant speed improvement in actual tests.