HP-GNN: Generating High Throughput GNN Training Implementation on CPU-FPGA Heterogeneous Platform

Yi-Chien Lin,Bingyi Zhang,Viktor Prasanna
DOI: https://doi.org/10.1145/3490422.3502359
2021-12-22
Abstract:Graph Neural Networks (GNNs) have shown great success in many applications such as recommendation systems, molecular property prediction, traffic prediction, etc. Recently, CPU-FPGA heterogeneous platforms have been used to accelerate many applications by exploiting customizable data path and abundant user-controllable on-chip memory resources of FPGAs. Yet, accelerating and deploying GNN training on such platforms requires not only expertise in hardware design but also substantial development efforts. We propose HP-GNN, a novel framework that generates high throughput GNN training implementations on a given CPU-FPGA platform that can benefit both application developers and machine learning researchers. HP-GNN takes GNN training algorithms, GNN models as the inputs, and automatically performs hardware mapping onto the target CPU-FPGA platform. HP-GNN consists of: (1) data layout and internal representation that reduce the memory traffic and random memory accesses; (2) optimized hardware templates that support various GNN models; (3) a design space exploration engine for automatic hardware mapping; (4) high-level application programming interfaces (APIs) that allows users to specify GNN training with only a handful of lines of code. To evaluate HP-GNN, we experiment with two well-known sampling-based GNN training algorithms and two GNN models. For each training algorithm and model, HP-GNN generates implementation on a state-of-the-art CPU-FPGA platform. Compared with CPU-only and CPU-GPU platforms, experimental results show that the generated implementations achieve $55.67\times$ and $2.17\times$ speedup on the average, respectively. Compared with the state-of-the-art GNN training implementations, HP-GNN achieves up to $4.45\times$ speedup.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to efficiently achieve high - throughput generation for graph neural network (GNN) training on the CPU - FPGA heterogeneous platform. Specifically, the paper proposes a new framework named HP - GNN, aiming to automatically map GNN training algorithms and models to the target CPU - FPGA platform, thereby improving the efficiency and performance of GNN training. The background of this problem is that although GNN has achieved remarkable success in fields such as recommendation systems, molecular property prediction, and traffic prediction, GNN training on large - scale graph data still faces the challenges of memory bandwidth limitations and high computational complexity. In addition, existing acceleration methods usually require specialized hardware design knowledge, which limits their wide application. HP - GNN solves these problems in the following ways: 1. **Data Layout and Internal Representation**: Optimizes data storage and access patterns, reducing memory traffic and random memory access. 2. **Optimized Hardware Templates**: Supports multiple GNN models and provides flexible hardware acceleration schemes. 3. **Design Space Exploration Engine**: Automates hardware configuration optimization to adapt to different GNN training algorithms and parameters. 4. **Advanced Application Programming Interfaces (APIs)**: Provides an easy - to - use software interface, enabling users to quickly develop GNN training programs without in - depth knowledge of hardware details. Through these innovations, HP - GNN not only improves the throughput of GNN training but also lowers the development threshold, allowing more application developers and machine - learning researchers to use the CPU - FPGA platform for efficient GNN training. Experimental results show that compared with using only the CPU or the CPU - GPU platform, HP - GNN improves the average speed by 55.67 times and 2.17 times respectively, and compared with the state - of - the - art GNN training implementation, the speed is increased by 4.45 times.