Abstract:Abstract In recent years, With the development of computer hardware, the supercomputer industry has ushered in a stage of rapid development, and its architecture has also evolved from traditional multi-core to many-core and heterogeneous many-core. Among them, sunway Many-Core Platform series with completely independent intellectual property rights is the representative of China’s supercomputing heterogeneous many-core processors. As a computing kernel, SpMV (sparse matrix-vector multiplication) is of great significance in scientific and engineering computing whose computing performance often has a great impact on the overall performance of applications. The article analyzes the master-slave acceleration architecture of the SW26010p Many-Core Platform processor and the implementation of the sparse matrix in the CSR storage format on the SW26010p Many-Core Platform. Due to the limited memory of the slave core of the SW26010p, the vector data stored in large-scale SpMV cannot be satisfied, resulting in a long memory access time and reduced performance. To solve this problem and optimize the calculation performance of SpMV, this paper has launched a research on the optimization strategy of SpMV for SW26010p Many-Core Platform. Firstly, we propose a method of assigning tasks by the number of rows in which the non-zero elements are located to solve the load balancing problem among slave cores. Secondly, we propose an adaptive memory allocation algorithm for LDM to achieve the optimal use of LDM memory. Thirdly, according to the refined division of the LDM space, various algorithms such as the dynamic and static double cache algorithm based on the secondary core architecture LRU and LUR-k, and the dynamic and static cache elimination algorithm based on the secondary core architecture ARC are proposed to improve the hit rate of vector x respectively. the performance of SpMV is optimized by reducing communication time and improving calculation and memory access ratio. Finally, several representative sparse matrices are selected from matrix set (Market) and tested, and the performance of several algorithms is analyzed. The results show that compared with the traditional method, the overall x hit ratio of our scheme is greatly improved, and the master-slave acceleration ratio is also greatly improved, the maximum acceleration ratio can reach more than 20 times and the average speed-up ratio can reach 10.5 times, which has a very good optimization effect. Meanwhile, the optimization methods adopted in this paper can be used for reference for other complex applications of SW26010p.

Efficiently Running SpMV on Multi-core DSPs for Banded Matrix

Optimizing SpMV on Heterogeneous Multi-Core DSPs Through Improved Locality and Vectorization

A Cross-Platform SpMV Framework on Many-Core Architectures.

Tpspmv: A Two-Phase Large-Scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures

Towards Efficient SpMV on Sunway Manycore Architectures.

Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format

A sparse matrix vector multiplication accelerator based on high-bandwidth memory

Towards Large-Scale Sparse Matrix-Vector Multiplication on the SW26010 Manycore Architecture.

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

Performance Optimization for Parallel SpMV on a NUMA Architecture

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Efficient Algorithm Design of Optimizing SpMV on GPU.

B-Sct: Improve Spmv Processing On Simd Architectures

Research on SpMV Implementation and Vector X Hit Rate Optimization for SW26010p Many-Core Platform

FPGA and GPU Implementation of Large Scale SpMV

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs

Performance Optimization for SpMV on Multi-GPU Systems Using Threads and Multiple Streams

Optimizing General Matrix Multiplications on Modern Multi-core DSPs