Abstract:This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. With this, it is shown how approximate FP64x2 GEMM accuracy can be cast in terms of ten ``cascading'' FP64 GEMMs. Promising results from preliminary performance and accuracy experiments are reported. The demonstrated techniques open up new research directions for more general cascading of higher-precision computation in terms of lower-precision computation for GEMM-like functionality.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper explores how to achieve higher precision (e.g., double-double precision, FP64x2) GEMM through low precision (e.g., double precision, FP64) high-performance matrix-matrix multiplication (GEMM). Specifically, the authors demonstrate how to approximate high precision GEMM by decomposing double-double precision matrices into multiple double precision matrices and utilizing their multiplications. ### Main Contributions 1. **Strategy**: A method is proposed to decompose FP64x2 matrices into multiple FP64 matrices, allowing FP64x2 GEMM to be estimated through 10 FP64 GEMMs. 2. **Prototype Implementation**: A prototype implementation is developed, demonstrating how the proposed scheme can achieve high performance and high precision in practical applications, while: - Using a workspace similar to high-performance FP64 GEMM implementations, - Leveraging existing FP64 GEMM kernels for portable high performance, - Performing all high-order O(n^3) work using only FP64 arithmetic, with most O(n^2) work also using only FP64 arithmetic. 3. **New Opportunities**: New research directions are discussed, including: - How to more generally support high precision GEMM through low precision GEMM, - Opportunities for leveraging GPUs, - The need for in-depth numerical analysis, - The benefits of scaling/balancing methods, - The potential for hardware support, - How to support other level-3 BLAS functions. ### Background - **Precision Issues**: Single precision (FP32) and double precision (FP64) are typically supported in hardware, but higher precision (e.g., quadruple precision, FP128) is less supported because hardware area grows quadratically with the size of the mantissa. - **Double-Double Precision (FP64x2)**: Extends precision numbers by storing them as two FP64 numbers, providing more mantissa bits without increasing the exponent range. Although computation time remains longer, it is simpler and faster than software-implemented FP128. - **Early Work**: Early research focused on achieving higher precision GEMM or dot products through low precision computation, but input matrices were typically stored in low precision. ### Methodology - **Decomposition**: Decompose FP64x2 matrices into four FP64 blocks, each representing a part of the original matrix. - **Computation**: Perform FP64 GEMM on these blocks and accumulate results at the appropriate precision. - **Error Analysis**: Preliminary analysis of the errors introduced when decomposing FP64x2 GEMM into FP64 GEMM, particularly the impact of matrix transformations on errors. ### Conclusion By decomposing high precision GEMM into multiple low precision GEMMs, high performance and high precision computation can be achieved. This approach not only advances the understanding of how to implement FP64x2 GEMM but also provides many new directions for future research.

Cascading GEMM: High Precision from Low Precision

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Multigrid Methods using Block Floating Point Arithmetic

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

A Study of Mixed Precision Strategies for GMRES on GPUs

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Optimization of SpGEMM with Risc-V vector instructions

DGEMM on Integer Matrix Multiplication Unit

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Generating Families of Practical Fast Matrix Multiplication Algorithms

Faster arbitrary-precision dot product and matrix multiplication

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

A coordinated tiling and batching framework for efficient GEMM on GPUs.