Abstract:Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple's Siri) and edge computing (e.g., Google/Waymo's driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, "Super-LIP", which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48x speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

Multi-FPGA Implementation of Feedforward Network and Its Performance Analysis

A Multi-Level-Optimization Framework for FPGA-Based Cellular Neural Network Implementation.

The implementation of feedforward network on field programmable gate array

Spike Trains Encoding Optimization for Spiking Neural Networks Implementation in FPGA

A Multi-Fpga Embedded System For The Emulation Of Modular Small-World Network With Real Time Dynamics

A Case for Spiking Neural Network Simulation Based on Configurable Multiple-Fpga Systems

FPGA Implementation of Feed-Forward MLP Neural Networks

Hardware implementation of spiking neural networks on FPGA

Mapping Very Large Scale Spiking Neuron Network to Neuromorphic Hardware.

Highly efficient neuromorphic learning system of spiking neural network with multi-compartment leaky integrate-and-fire neurons

FPGA Application in Neural System Model Simulation

XFER: A Novel Design to Achieve Super-Linear Performance on Multiple FPGAs for Real-Time AI.

FireFly: A High-Throughput Hardware Accelerator for Spiking Neural Networks with Efficient DSP and Memory Optimization

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

An FPGA Implementation of Deep Spiking Neural Networks for Low-Power and Fast Classification

A Low-Cost and High-Speed Hardware Implementation of Spiking Neural Network

Digital Implementation of the Spiking Neural Network and Its Digit Recognition

A Low Power and Low Latency FPGA-Based Spiking Neural Network Accelerator

Multi-clusters: an Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

Multi-core ARM-based Hardware-Accelerated Computation for Spiking Neural Networks

Neural Network with Cascaded Model Dendritic Morphologic and FPGA Implementation