Abstract:The Matrix2 Accelerator is a high performance multi-core vector processor for high-density computing. We design an efficient parallel implementation of the Linpack benchmark for the Matrix2. (1) We propose an efficient parallel matrix multiplication algorithm. It designs the optimal block parameters for the innermost sub-block matrix multiplication based on architecture characteristics of the Matrix2. It fully exploits multi-level parallelism including instruction-level parallelism, vector unit-level parallelism, and core-level parallelism; A vectorization method based on row computation for matrix multiplication is proposed, which avoids the inefficient column accesses and reduction summations between VPEs, and can obtain optimal kernel performance. (2) We propose an efficient parallel triangular matrix multiplication algorithm. It evenly distributes the irregular triangular matrix multiplication to different vector processing units, and fully leverage the computation capacity of the vector processor. It also supports in-place computation, which stores the result matrix into the space of the original multiplier matrix to save the memory consumption. (3) We propose an efficient parallel solving method of triangular equations. It significantly improves the computational efficiency by solving the equations in parallel using multiple cores. (4) We configure the L1D to a SRAM mode for finer software memory management. A data transfer strategy based on a two-level DMA double buffering scheme is proposed to optimize and smooth data transmission between different levels of the memory hierarchy. It allows the data movement to completely overlap with the kernel computation, allowing the kernel program to always run at peak speed. The experimental results on Matrix2 show that the efficiencies of double-precision parallel matrix multiplication, parallel triangular matrix multiplication, and Linpack computation are 96.08%, 91.47%, 84.58%, respectively.

Automatic Mapping and Code Optimization for OpenCL Kernels on FT-matrix Architecture (WIP Paper)

FT-Matrix: A Coordination-Aware Architecture for Signal Processing

Automatic Mapping Single-Device OpenCL Program to Heterogeneous Multi-device Platform.

OpenFFT: an Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs.

A study of vectorization for matrix-free finite element methods

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

Auto-Tuning Of Thread Assignment For Matrix-Vector Multiplication On Gpus

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

Improving Performance Portability for GPU-specific OpenCL Kernels on Multi-Core/many-core CPUs by Analysis-Based Transformations

Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster.

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

Hexagonal Tiling Based Multiple FPGAs Stencil Computation Acceleration and Optimization Methodology.

Experience of Optimizing FFT on Intel Architectures

Characterize and Optimize Dense Linear Solver on Multi-core CPUs

Mapping Parallelism in a Functional IR through Constraint Satisfaction

Hardware-Software Co-Design of Matrix-Solving for Non-Linear Optimization in SLAM Systems

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

Automatic Optimization Heuristics Method for OpenCL Program Based on Graph Neural Network

Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures.

A Performance Analysis Framework For Optimizing Opencl Applications On Fpgas