Abstract:The Matrix2 Accelerator is a high performance multi-core vector processor for high-density computing. We design an efficient parallel implementation of the Linpack benchmark for the Matrix2. (1) We propose an efficient parallel matrix multiplication algorithm. It designs the optimal block parameters for the innermost sub-block matrix multiplication based on architecture characteristics of the Matrix2. It fully exploits multi-level parallelism including instruction-level parallelism, vector unit-level parallelism, and core-level parallelism; A vectorization method based on row computation for matrix multiplication is proposed, which avoids the inefficient column accesses and reduction summations between VPEs, and can obtain optimal kernel performance. (2) We propose an efficient parallel triangular matrix multiplication algorithm. It evenly distributes the irregular triangular matrix multiplication to different vector processing units, and fully leverage the computation capacity of the vector processor. It also supports in-place computation, which stores the result matrix into the space of the original multiplier matrix to save the memory consumption. (3) We propose an efficient parallel solving method of triangular equations. It significantly improves the computational efficiency by solving the equations in parallel using multiple cores. (4) We configure the L1D to a SRAM mode for finer software memory management. A data transfer strategy based on a two-level DMA double buffering scheme is proposed to optimize and smooth data transmission between different levels of the memory hierarchy. It allows the data movement to completely overlap with the kernel computation, allowing the kernel program to always run at peak speed. The experimental results on Matrix2 show that the efficiencies of double-precision parallel matrix multiplication, parallel triangular matrix multiplication, and Linpack computation are 96.08%, 91.47%, 84.58%, respectively.

Improving the DRAM Access Efficiency for Matrix Multiplication on Multicore Accelerators.

A design framework for processing-in-memory accelerator

Improving System Performance in Heterogeneous MPSoC Systems via Dynamic DRAM Bandwidth Allocation

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Direct Distributed Memory Access for CMPs

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

Poster: revisiting virtual channel memory for performance and fairness on multi-core architecture.

A Study of Leveraging Memory Level Parallelism for DRAM System on Multi-core/Many-Core Architecture

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Fault-Tolerant Masked Matrix Accumulation using Bulk Bitwise In-Memory Engines

non-aligned memory access acceleration method based on inter-register communication

KUMMS: Optimising DRAM Locality with Kernel-user Behaviours.

Accelerating Data Movement on Future Chip Multi-Processors

Efficient Distributed Memory Management with RDMA and Caching

High Performance Matrix Multiplication on Many Cores

A Performance Evaluation of DRAM Access for In-Memory Databases

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

A Processor-DMA-Based Memory Copy Hardware Accelerator

MCS-DMA: An optimization design of memory controller for DMA transfers in SoC

High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems

Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller