FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAs

Erfan Bank Tavakoli,Michael Riera,Masudul Hassan Quraishi,Fengbo Ren
DOI: https://doi.org/10.1109/tvlsi.2024.3355499
2024-01-01
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Abstract:General sparse matrix–matrix multiplication (SpGEMM) is integral to many high-performance computing (HPC) and machine learning applications. However, prior field-programmable gate array (FPGA)-based SpGEMM accelerators either use the inner product algorithm with wasted and costly operations or Gustavson’s algorithm with a cache-based hardware architecture suffering from long-latency cache miss penalties and limited to embedded devices. In this work, we propose framework for accelerating SpGEMM (FSpGEMM), an OpenCL-based SpGEMM framework for accelerating Gustvason’s algorithm that includes an FPGA kernel implementing a throughput-optimized and scalable hardware architecture compatible with high-bandwidth memory (HBM) or traditional DDR-based memory. In addition, to address the irregular memory access patterns incurred by Gustavson’s algorithm, we propose a new buffering scheme tailored to Gustavson’s algorithm enabled by a new compressed sparse vector (CSV) format for representing sparse matrices and a row reordering technique as a preprocessing step to improve data reuse, and consequently, resource utilization. The proposed framework includes a host program implementing preprocessing functions for reordering input matrices and storing them in the proposed CSV format for further use. We implemented FSpGEMM using Intel FPGA SDK for OpenCL and experimented with a benchmark of sparse matrices selected from the SuiteSparse Matrix Collection on a Bittware 520N-MX FPGA board. The results show that the reordering technique improves the performance on average by 20.3% compared with the baseline. Finally, FSpGEMM outperforms the state-of-the-art (SOTA) FPGA implementation by an average of $2.23\times $ in terms of execution cycles with the same benchmark and memory system configuration for a fair comparison.
engineering, electrical & electronic,computer science, hardware & architecture
What problem does this paper attempt to address?