Abstract:In recent years, graph neural networks (GNNs) have achieved impressive performance in various application fields by extracting information from graph-structured data. It contains extensive feature aggregation operations and has become a performance bottleneck, which can be abstracted as a specialized sparse-dense matrix multiplication (SpMM) operation. Previous works have leveraged the inner product or outer product to accelerate the feature aggregation process. However, inefficient execution leads to extremely unbalanced workloads and extensive intermediate data, hampering the performance of previous processors. So in this article, we demonstrate an algorithm/hardware co-optimization chance to enhance SpMM acceleration for GNNs. First, the algorithm part develops a dataflow-efficient SpMM algorithm that integrates three optimization methods to mitigate computation and memory access inefficiencies. Specifically, 1) the proposed equal-value partition method achieves fine-grained data partition and enables load balancing during data movement; 2) after observing the vertex aggregation phenomenon, a vertex-clustering optimization method is presented to enable significant data locality; and 3) the adaptive dataflow based on Gustavson’s algorithm is further implemented to enable the efficient distribution of sparse elements and improves computing resource utilization. Then, the hardware part features the proposed SpMM algorithm and customizes SDMA, a flexible and efficient accelerator to boost SpMM acceleration, which follows the adaptive dataflow to eliminate sparsity and explore the regular parallelism dimension. Finally, we prototype SDMA on the Xilinx Alveo U280 FPGA accelerator card. The results demonstrate that SDMA achieves $5.68\times $ – $14.68\times $ energy efficiency over the previous GPU implementations deployed on the Nvidia GTX 1080Ti and $1.32\times $ higher throughput over the state-of-the-art FPGA prototype.

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

High Performance Optimizations For Nuclear Physics Code Mfdn On Knl

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

MPI+X:Massive Parallelization and Dynamic Load Balance of a Production-level Unstructured DSMC Solver

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Network states-aware collective communication optimization

NestedMP: Enabling Cache-Aware Thread Mapping for Nested Parallel Shared Memory Applications

NUMA-aware shared-memory collective communication for MPI

Node Aware Sparse Matrix-Vector Multiplication

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-Core Processor

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application

Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNs

Study on MPI/OpenMP hybrid parallelism for Monte Carlo neutron transport code

On Optimizing the Communication of Model Parallelism

ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast

A More Scalable Sparse Dynamic Data Exchange

Enhancing Scalability and Performance in Influence Maximization with Optimized Parallel Processing