Abstract:The Matrix2 Accelerator is a high performance multi-core vector processor for high-density computing. We design an efficient parallel implementation of the Linpack benchmark for the Matrix2. (1) We propose an efficient parallel matrix multiplication algorithm. It designs the optimal block parameters for the innermost sub-block matrix multiplication based on architecture characteristics of the Matrix2. It fully exploits multi-level parallelism including instruction-level parallelism, vector unit-level parallelism, and core-level parallelism; A vectorization method based on row computation for matrix multiplication is proposed, which avoids the inefficient column accesses and reduction summations between VPEs, and can obtain optimal kernel performance. (2) We propose an efficient parallel triangular matrix multiplication algorithm. It evenly distributes the irregular triangular matrix multiplication to different vector processing units, and fully leverage the computation capacity of the vector processor. It also supports in-place computation, which stores the result matrix into the space of the original multiplier matrix to save the memory consumption. (3) We propose an efficient parallel solving method of triangular equations. It significantly improves the computational efficiency by solving the equations in parallel using multiple cores. (4) We configure the L1D to a SRAM mode for finer software memory management. A data transfer strategy based on a two-level DMA double buffering scheme is proposed to optimize and smooth data transmission between different levels of the memory hierarchy. It allows the data movement to completely overlap with the kernel computation, allowing the kernel program to always run at peak speed. The experimental results on Matrix2 show that the efficiencies of double-precision parallel matrix multiplication, parallel triangular matrix multiplication, and Linpack computation are 96.08%, 91.47%, 84.58%, respectively.

Revisiting Linpack Algorithm on Large-scale CPU-GPU Heterogeneous Systems

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Online Scheduling on a CPU-GPU Cluster

Online Scheduling of Mixed CPU-GPU Jobs

Experience Of Parallelizing Cryo-Em 3d Reconstruction On A Cpu-Gpu Heterogeneous System

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

Improving Dense Linear Equation Solver on Hybrid CPU-GPU System.

GPU First -- Execution of Legacy CPU Codes on GPUs

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform

A New Hybrid GPU-CPU Sparse LDLT Factorization Algorithm with GPU and CPU Factorizing Concurrently

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Towards a Heterogeneous Architecture Solver for the Incompressible Navier–Stokes Equations

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

High Performance Computing Via a GPU

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

A Scalable Hybrid Algorithm for Solving Partial Differential Equations on a Cluster of CPU/GPU

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

623 Tflop/s HPCG Run on Tianhe-2: Leveraging Millions of Hybrid Cores.