Abstract:In recent years, graph neural networks (GNNs) have achieved impressive performance in various application fields by extracting information from graph-structured data. It contains extensive feature aggregation operations and has become a performance bottleneck, which can be abstracted as a specialized sparse-dense matrix multiplication (SpMM) operation. Previous works have leveraged the inner product or outer product to accelerate the feature aggregation process. However, inefficient execution leads to extremely unbalanced workloads and extensive intermediate data, hampering the performance of previous processors. So in this article, we demonstrate an algorithm/hardware co-optimization chance to enhance SpMM acceleration for GNNs. First, the algorithm part develops a dataflow-efficient SpMM algorithm that integrates three optimization methods to mitigate computation and memory access inefficiencies. Specifically, 1) the proposed equal-value partition method achieves fine-grained data partition and enables load balancing during data movement; 2) after observing the vertex aggregation phenomenon, a vertex-clustering optimization method is presented to enable significant data locality; and 3) the adaptive dataflow based on Gustavson’s algorithm is further implemented to enable the efficient distribution of sparse elements and improves computing resource utilization. Then, the hardware part features the proposed SpMM algorithm and customizes SDMA, a flexible and efficient accelerator to boost SpMM acceleration, which follows the adaptive dataflow to eliminate sparsity and explore the regular parallelism dimension. Finally, we prototype SDMA on the Xilinx Alveo U280 FPGA accelerator card. The results demonstrate that SDMA achieves $5.68\times $ – $14.68\times $ energy efficiency over the previous GPU implementations deployed on the Nvidia GTX 1080Ti and $1.32\times $ higher throughput over the state-of-the-art FPGA prototype.

Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core

Joint Sparsity with Mixed Granularity for Efficient GPU Implementation

Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural Networks

GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

High Performance Unstructured SpMM Computation Using Tensor Cores

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction

TSTC: Two-Level Sparsity Tensor Core Enabling Both Algorithm Flexibility and Hardware Efficiency

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNs

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs

Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow.

Efficient sparse-matrix multi-vector product on GPUs

Accelerating Sparse Approximate Matrix Multiplication on GPUs

BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration

FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs.