Abstract:The Matrix2 Accelerator is a high performance multi-core vector processor for high-density computing. We design an efficient parallel implementation of the Linpack benchmark for the Matrix2. (1) We propose an efficient parallel matrix multiplication algorithm. It designs the optimal block parameters for the innermost sub-block matrix multiplication based on architecture characteristics of the Matrix2. It fully exploits multi-level parallelism including instruction-level parallelism, vector unit-level parallelism, and core-level parallelism; A vectorization method based on row computation for matrix multiplication is proposed, which avoids the inefficient column accesses and reduction summations between VPEs, and can obtain optimal kernel performance. (2) We propose an efficient parallel triangular matrix multiplication algorithm. It evenly distributes the irregular triangular matrix multiplication to different vector processing units, and fully leverage the computation capacity of the vector processor. It also supports in-place computation, which stores the result matrix into the space of the original multiplier matrix to save the memory consumption. (3) We propose an efficient parallel solving method of triangular equations. It significantly improves the computational efficiency by solving the equations in parallel using multiple cores. (4) We configure the L1D to a SRAM mode for finer software memory management. A data transfer strategy based on a two-level DMA double buffering scheme is proposed to optimize and smooth data transmission between different levels of the memory hierarchy. It allows the data movement to completely overlap with the kernel computation, allowing the kernel program to always run at peak speed. The experimental results on Matrix2 show that the efficiencies of double-precision parallel matrix multiplication, parallel triangular matrix multiplication, and Linpack computation are 96.08%, 91.47%, 84.58%, respectively.

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform

NVIDIA Tensor Core Programmability, Performance & Precision

Acceleration of Tensor-Product Operations with Tensor Cores

Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

Improving Performance of Matrix Multiplication and FFT on GPU

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

An Open-Source Framework for Efficient Numerically-Tailored Computations

MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

SAM: A Scalable Accelerator for Number Theoretic Transform Using Multi-Dimensional Decomposition

A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

Performant low-order matrix-free finite element kernels on GPU architectures

Providing performance portable numerics for Intel GPUs

AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

High Performance Matrix Multiplication on General Purpose Graphics Processing Units