Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators

Hans Johnson,Tianyang Fang,Alejandro Perez-Vicente,Jafar Saniie
2023-05-25
Abstract:We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications focused on exploring distributing scheduling optimizations for Deep Learning (DL) workloads to obtain the best performance regarding latency and power efficiency. Our cluster was modular throughout the experiment, and we have implementations that consist of up to 12 Zynq-7020 chip-based boards as well as 5 UltraScale+ MPSoC FPGA boards connected through an ethernet switch, and the cluster will evaluate configurable Deep Learning Accelerator (DLA) Versatile Tensor Accelerator (VTA). This adaptable distributed architecture is distinguished by its capacity to evaluate and manage neural network workloads in numerous configurations which enables users to conduct multiple experiments tailored to their specific application needs. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the computation graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
Distributed, Parallel, and Cluster Computing,Hardware Architecture,Machine Learning,Systems and Control
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Mismatch between hardware and deep learning architectures**: Although deep learning (DL) frameworks have made progress in exploring new deep learning architectures, the development of electronic design automation (EDA) tools has lagged behind. This results in hardware description language (HDL) designs being difficult to modify and consuming excessive logical resources. 2. **Challenges in supporting new operations**: As neural network (NN) computation graphs become increasingly complex, the processing units (PUs) in application-specific integrated circuits (ASICs) are usually fixed. Therefore, deep learning compilers must support new computations on existing hardware, which is a complex and time-consuming task. 3. **Increased demand for efficient deep learning computation in edge computing**: To improve power efficiency, reduce latency, and optimize scheduling in edge computing applications, it is necessary to optimize neural network architectures and allocate specialized hardware to appropriate computation modules. The paper proposes a distributed system based on low-power embedded FPGAs, aiming to explore distributed scheduling optimization through edge computing applications to achieve optimal performance, especially in terms of latency and power efficiency. The system has a modular characteristic, capable of evaluating neural network workloads under various configurations, allowing users to conduct customized experiments based on specific application needs. Additionally, the system can execute different neural network models simultaneously, arrange computation graphs into pipeline structures, and manually allocate more resources to computation-intensive layers.