Abstract:The Matrix2 Accelerator is a high performance multi-core vector processor for high-density computing. We design an efficient parallel implementation of the Linpack benchmark for the Matrix2. (1) We propose an efficient parallel matrix multiplication algorithm. It designs the optimal block parameters for the innermost sub-block matrix multiplication based on architecture characteristics of the Matrix2. It fully exploits multi-level parallelism including instruction-level parallelism, vector unit-level parallelism, and core-level parallelism; A vectorization method based on row computation for matrix multiplication is proposed, which avoids the inefficient column accesses and reduction summations between VPEs, and can obtain optimal kernel performance. (2) We propose an efficient parallel triangular matrix multiplication algorithm. It evenly distributes the irregular triangular matrix multiplication to different vector processing units, and fully leverage the computation capacity of the vector processor. It also supports in-place computation, which stores the result matrix into the space of the original multiplier matrix to save the memory consumption. (3) We propose an efficient parallel solving method of triangular equations. It significantly improves the computational efficiency by solving the equations in parallel using multiple cores. (4) We configure the L1D to a SRAM mode for finer software memory management. A data transfer strategy based on a two-level DMA double buffering scheme is proposed to optimize and smooth data transmission between different levels of the memory hierarchy. It allows the data movement to completely overlap with the kernel computation, allowing the kernel program to always run at peak speed. The experimental results on Matrix2 show that the efficiencies of double-precision parallel matrix multiplication, parallel triangular matrix multiplication, and Linpack computation are 96.08%, 91.47%, 84.58%, respectively.

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Parallel Sparse Matrix Multiplication for Preconditioning and SSTA on a Many-Core Architecture

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Atomic Reduction Based Sparse Matrix-Transpose Vector Multiplication on GPUs

Optimizing sparse matrix-vector multiplication based on gpu

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations.

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Avoiding communication in sparse matrix computations

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

On Parallelizing Matrix Multiplication by the Column-Row Method

An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Implementation of a Parallel Sparse Direct Solver on Vector Architecture

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Performance Modeling and Optimization of Sparse Matrix-Vector Multiplication on NVIDIA CUDA Platform