Abstract:In recent years, graph neural networks (GNNs) have achieved impressive performance in various application fields by extracting information from graph-structured data. It contains extensive feature aggregation operations and has become a performance bottleneck, which can be abstracted as a specialized sparse-dense matrix multiplication (SpMM) operation. Previous works have leveraged the inner product or outer product to accelerate the feature aggregation process. However, inefficient execution leads to extremely unbalanced workloads and extensive intermediate data, hampering the performance of previous processors. So in this article, we demonstrate an algorithm/hardware co-optimization chance to enhance SpMM acceleration for GNNs. First, the algorithm part develops a dataflow-efficient SpMM algorithm that integrates three optimization methods to mitigate computation and memory access inefficiencies. Specifically, 1) the proposed equal-value partition method achieves fine-grained data partition and enables load balancing during data movement; 2) after observing the vertex aggregation phenomenon, a vertex-clustering optimization method is presented to enable significant data locality; and 3) the adaptive dataflow based on Gustavson’s algorithm is further implemented to enable the efficient distribution of sparse elements and improves computing resource utilization. Then, the hardware part features the proposed SpMM algorithm and customizes SDMA, a flexible and efficient accelerator to boost SpMM acceleration, which follows the adaptive dataflow to eliminate sparsity and explore the regular parallelism dimension. Finally, we prototype SDMA on the Xilinx Alveo U280 FPGA accelerator card. The results demonstrate that SDMA achieves $5.68\times $ – $14.68\times $ energy efficiency over the previous GPU implementations deployed on the Nvidia GTX 1080Ti and $1.32\times $ higher throughput over the state-of-the-art FPGA prototype.

FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAs

FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

FPGA and GPU Implementation of Large Scale SpMV

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow.

SpArch: Efficient Architecture for Sparse Matrix Multiplication

GUST: Graph Edge-Coloring Utilization for Accelerating Sparse Matrix Vector Multiplication

Accelerating Unstructured SpGEMM using Structured In-situ Computing

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-art to Future Opportunities

Optimization of SpGEMM with Risc-V vector instructions

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNs

SMASH: Sparse Matrix Atomic Scratchpad Hashing

Fast and Practical Strassen's Matrix Multiplication using FPGAs

A sparse matrix vector multiplication accelerator based on high-bandwidth memory

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

FSHMEM: Supporting Partitioned Global Address Space on FPGAs for Large-Scale Hardware Acceleration Infrastructure

SAGE: A Storage-Based Approach for Scalable and Efficient Sparse Generalized Matrix-Matrix Multiplication

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms