Abstract:Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA's large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements' values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix's nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work.

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

MALMM: A Multi-Array Architecture for Large-Scale Matrix Multiplication on FPGA.

Fast and Practical Strassen's Matrix Multiplication using FPGAs

Design of Field Programmable Gate Array Based Real-Time Double-Precision Floating-Point Matrix Multiplier

Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture

Configurable sparse matrix - matrix multiplication accelerator on FPGA: A systematic design space exploration approach with quantization effects

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA

Scalable Systolic Array Multiplier Optimized by Sparse Matrix.

An Accelerator Architecture of Changeable-Dimension Matrix Computing Method for SVM

Towards a Deep-Pipelined Architecture for Accelerating Deep GCN on a Multi-FPGA Platform

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs

A High-Performance Systolic Array Accelerator Dedicated for CNN.

Multi-clusters: an Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

A Scalable 3D Array Architecture for Accelerating Convolutional Neural Networks

A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks

Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs.

XFER: A Novel Design to Achieve Super-Linear Performance on Multiple FPGAs for Real-Time AI.

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-art to Future Opportunities

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis