Abstract:The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, large-scale sparse matrix multiplication on GPUs hinders the development of these applications. An effective approach is heterogeneous cores collaborative computing, but this method must address three aspects: (1) irregular non-zero elements lead to load imbalance and irregular memory access, (2) different core computing latency differences reduce computational parallelism, and (3) temporary data transfer between different cores introduces additional latency overhead. In this work, we propose an innovative framework for collaborative large-scale sparse matrix multiplication on CPU-GPU heterogeneous cores, named ApSpGEMM. ApSpGEMM is based on sparsity rules and proposes reordering and splitting algorithms to eliminate the impact of non-zero element distribution features on load and memory access. Then adaptive panels allocation with affinity constraints among cores improves computational parallelism. Finally, carefully arranged asynchronous data transmission and computation balance communication overhead. Compared with state-of-the-art SpGEMM methods, our approach provides excellent absolute performance on matrices with different sparse structures. On heterogeneous cores, the GFlops of large-scale sparse matrix multiplication is improved by 2.25 to 7.21 times.

Optimizing General Matrix Multiplications on Modern Multi-core DSPs

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Optimizing Stencil Computation on Multi-core DSPs

GEMM Optimization for a Decoupled Access/Execute Architecture Processor

Optimizing sparse general matrix–matrix multiplication for DCUs

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors

Exploring the Architecture of Multiple GEMM Accelerators in Heterogeneous Systems

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

Efficiently Running SpMV on Multi-core DSPs for Banded Matrix

Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs

Optimizing Multi-grid Computation and Parallelization on Multi-cores.

High Performance Matrix Multiplication on Many Cores

DGEMM on Integer Matrix Multiplication Unit

ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel

An Efficient Method of Parallel Multiplication on a Single DSP Slice for Embedded FPGAs

Improving Performance of Matrix Multiplication and FFT on GPU