Abstract:In recent years, the fervent demand for computational power across various domains has prompted hardware manufacturers to introduce specialized computing hardware aimed at enhancing computational capabilities. Particularly, the utilization of tensor hardware supporting low precision has gained increasing prominence in scientific research. However, the use of low-precision tensor hardware for computational acceleration often introduces errors, posing a fundamental challenge of simultaneously achieving effective acceleration while maintaining computational accuracy.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of how to achieve efficient acceleration while maintaining the accuracy of calculation results when performing computational acceleration on low - precision tensor hardware. Specifically, the author proposes the following key problems and solutions: 1. **Error control in low - precision calculation**: - Using low - precision (such as 8 - bit integers, 4 - bit integers, etc.) for matrix multiplication operations can significantly improve the calculation speed, but it will introduce quantization errors, which will affect the accuracy of the calculation results. - The paper proposes a hybrid precision quantization method (hybrid precision quantization), combined with residual compensation quantization (residual compensation quantization) to reduce quantization errors. 2. **Application of sparse matrices**: - Sparse matrices are introduced to reduce the computational complexity. By only focusing on values that may have a significant impact on the relative error, the computational complexity can be reduced while controlling the quantization error. - Specifically, the author uses sparse matrix multiplication (sparse matrix multiplication) to replace dense matrix multiplication (dense matrix multiplication) to reduce unnecessary calculations. 3. **Acceleration of low - precision matrix multiplication**: - A threshold - based method is proposed to control the amount of calculation in low - precision matrix multiplication, ensuring efficient calculation within an acceptable error range. - A high - performance low - precision quantization algorithm is designed, and the effectiveness of this algorithm is verified through a series of experiments. ### Formula summary - **Quantization formula**: \[ a_{\text{int}} = Q(a_{\text{fp}}, \lambda) = \text{TypeCast}(\lambda \cdot a_{\text{fp}}, \text{int}_N) \] \[ \lambda = \frac{2^{N - 1}-1}{a_{\text{max}}} \] - **De - quantization formula**: \[ a_{\text{fp}} = \tilde{Q}(a_{\text{int}}, \lambda) = \text{TypeCast}\left(\frac{a_{\text{int}}}{\lambda}, \text{float}_N\right) \] - **Low - precision matrix multiplication**: \[ M_{\text{fp32}} = A_{\text{fp32}}\cdot B_{\text{fp32}} = \frac{A_{\text{int}}\cdot B_{\text{int}}}{\lambda_M} \] \[ \lambda_M=\lambda_A\cdot\lambda_B \] - **Residual compensation matrix multiplication**: \[ C_{\text{fp}}=\frac{A_{\text{int}}\cdot B_{\text{int}}}{\lambda_A\cdot\lambda_B}+\frac{A_{\text{int}}\cdot R_{B,\text{int}}}{\lambda_A\cdot\lambda_{R_B}}+\frac{R_{A,\text{int}}\cdot B_{\text{int}}}{\lambda_{R_A}\cdot\lambda_B} \] Through these methods, the paper successfully achieves efficient acceleration of low - precision calculations while ensuring the accuracy of the calculation results.

A method for accelerating low precision operations by sparse matrix multiplication

Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio

Hardware-Efficient Mixed-Precision CP Tensor Decomposition

Acceleration of Approximate Matrix Multiplications on GPUs

Precision-Aware Iterative Algorithms Based on Group-Shared Exponents of Floating-Point Numbers

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Acceleration of complex matrix multiplication using arbitrary precision floating-point arithmetic

Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Floating-Point Multiply-Add with Approximate Normalization for Low-Cost Matrix Engines

Optimizing sparse general matrix–matrix multiplication for DCUs

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

GAS: General-Purpose In-Memory-Computing Accelerator for Sparse Matrix Multiplication

Improvement of Sparse Matrix-Vector Multiplication on GPU

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

FullSparse: A Sparse-Aware GEMM Accelerator with Online Sparsity Prediction