Cascading GEMM: High Precision from Low Precision

Devangi N. Parikh,Robert A. van de Geijn,Greg M. Henry
2023-03-08
Abstract:This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. With this, it is shown how approximate FP64x2 GEMM accuracy can be cast in terms of ten ``cascading'' FP64 GEMMs. Promising results from preliminary performance and accuracy experiments are reported. The demonstrated techniques open up new research directions for more general cascading of higher-precision computation in terms of lower-precision computation for GEMM-like functionality.
Mathematical Software
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper explores how to achieve higher precision (e.g., double-double precision, FP64x2) GEMM through low precision (e.g., double precision, FP64) high-performance matrix-matrix multiplication (GEMM). Specifically, the authors demonstrate how to approximate high precision GEMM by decomposing double-double precision matrices into multiple double precision matrices and utilizing their multiplications. ### Main Contributions 1. **Strategy**: A method is proposed to decompose FP64x2 matrices into multiple FP64 matrices, allowing FP64x2 GEMM to be estimated through 10 FP64 GEMMs. 2. **Prototype Implementation**: A prototype implementation is developed, demonstrating how the proposed scheme can achieve high performance and high precision in practical applications, while: - Using a workspace similar to high-performance FP64 GEMM implementations, - Leveraging existing FP64 GEMM kernels for portable high performance, - Performing all high-order O(n^3) work using only FP64 arithmetic, with most O(n^2) work also using only FP64 arithmetic. 3. **New Opportunities**: New research directions are discussed, including: - How to more generally support high precision GEMM through low precision GEMM, - Opportunities for leveraging GPUs, - The need for in-depth numerical analysis, - The benefits of scaling/balancing methods, - The potential for hardware support, - How to support other level-3 BLAS functions. ### Background - **Precision Issues**: Single precision (FP32) and double precision (FP64) are typically supported in hardware, but higher precision (e.g., quadruple precision, FP128) is less supported because hardware area grows quadratically with the size of the mantissa. - **Double-Double Precision (FP64x2)**: Extends precision numbers by storing them as two FP64 numbers, providing more mantissa bits without increasing the exponent range. Although computation time remains longer, it is simpler and faster than software-implemented FP128. - **Early Work**: Early research focused on achieving higher precision GEMM or dot products through low precision computation, but input matrices were typically stored in low precision. ### Methodology - **Decomposition**: Decompose FP64x2 matrices into four FP64 blocks, each representing a part of the original matrix. - **Computation**: Perform FP64 GEMM on these blocks and accumulate results at the appropriate precision. - **Error Analysis**: Preliminary analysis of the errors introduced when decomposing FP64x2 GEMM into FP64 GEMM, particularly the impact of matrix transformations on errors. ### Conclusion By decomposing high precision GEMM into multiple low precision GEMMs, high performance and high precision computation can be achieved. This approach not only advances the understanding of how to implement FP64x2 GEMM but also provides many new directions for future research.