Abstract:The sparse triangular solver (SpTRSV) is one of the most essential kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to inherent dependency of computation and discontinuous memory accesses. Achieving high performance of SpTRSV is even more challenging for SW26010, the new-generation customized heterogeneous many-core processor equipped in the top-rank Sunway TaihuLight supercomputer. Owing to regular sparse pattern, structured-grid triangular problems show much different computing characteristics with general ones as well as new opportunities to algorithm design on many-core architectures, which ever lacks attention. In this work, we focus on how to design and implement fast SpTRSV for structured-grid problems on SW26010. A generalized algorithm framework of parallel SpTRSV is proposed for best utilization of the features and flexibilities of SW26010 many-core architecture according to the fine-grained Producer-Consumer model. Moreover, a novel parallel structured-grid SpTRSV is presented by using direct data transfers across registers of the computing elements of SW26010. Experiments on four typical structured-grid triangular problems with different problem sizes demonstrate that our SpTRSV can achieve an average momory bandwidth utilization of 79.7% according to the stream benchmark, which leads to a speedup of 17.7 over serial version on SW26010. Furthermore, experiments with real world sparse linear problems show that our proposed SpTRSV can achieve superior preconditioning performance over the Intel Xeon E5-2670 v3 CPU and Intel Xeon Phi 7210 KNL over DDR4 memory.

Characterize and Optimize Dense Linear Solver on Multi-core CPUs

Improving Dense Linear Equation Solver on Hybrid CPU-GPU System.

Parallel Sparse LU Factorization With Machine-Learning Method on Multi-core Processors

Highly Efficient Parallel Direct Solver for Solving Dense Complex Matrix Equations from Method of Moments

Multicore-Based Performance Optimization For Dense Matrix Computation

Optimizing Algorithm of Sparse Linear Systems on GPU

Performance Modeling and Optimization of Parallel LU-SGS on Many-Core Processors for 3D High-Order CFD Simulations

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

Parallel Tridiagonal Solver on Sunway Many-Core Processors*

Implementation of a Parallel Sparse Direct Solver on Vector Architecture

Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture.

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures.

Multicore-Based Performance Optimization For Evaluating The Inverse Of Sparse Matrix

Implementation and Optimization of Dense LU Decomposition on the Stream Processor

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

Accelerating the iterative linear solver for reservoir simulation on multicore architectures

Research on the Optimization of BLAS Level 1 and 2 Functions on Shenwei Many-Core Processor

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

Toward Efficient Structured-Grid Triangular Solver on Sunway Many-Core Processors