Design optimization for high-performance computing using FPGA

Murat Isik,Kayode Inadagbo,Hakan Aktas
2023-04-25
Abstract:Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations in several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. We optimize Tensil AI's open-source inference accelerator for maximum performance using ResNet20 trained on CIFAR in this paper in order to gain insight into the use of FPGAs for high-performance computing. In this paper, we show how improving hardware design, using Xilinx Ultra RAM, and using advanced compiler strategies can lead to improved inference performance. We also demonstrate that running the CIFAR test data set shows very little accuracy drop when rounding down from the original 32-bit floating point. The heterogeneous computing model in our platform allows us to achieve a frame rate of 293.58 frames per second (FPS) and a %90 accuracy on a ResNet20 trained using CIFAR. The experimental results show that the proposed accelerator achieves a throughput of 21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power consumption at 100 MHz. The comparison results with off-the-shelf devices and recent state-of-the-art implementations illustrate that the proposed accelerator has obvious advantages in terms of energy efficiency.
Hardware Architecture,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance optimization of high - performance computing systems based on FPGA (Field - Programmable Gate Array). Specifically, the author focuses on how to improve the performance of the Tensil AI open - source inference accelerator on FPGA by optimizing the hardware design, using Xilinx Ultra RAM and adopting advanced compilation strategies. The paper particularly emphasizes the optimization of the ResNet20 model and the situation of training with the CIFAR - 10 dataset, aiming to explore the application potential of FPGA in the field of high - performance computing. ### Specific problems solved in the paper include: 1. **Programming complexity and difficulty in performance optimization**: Although FPGA has unique advantages in terms of flexibility, performance and energy efficiency, its programming is highly complex and there are great challenges in optimizing performance. The paper improves the inference performance on FPGA by reducing network parameters and computational requirements through specific methods such as pruning and quantization techniques. 2. **Memory access efficiency**: By introducing the Ultra RAM solution, the memory access and utilization are optimized. Especially when storing the weights of neural network models, fast and efficient access is achieved, further improving the performance. 3. **Dual - clock scheme**: By introducing a second clock domain (333MHz) and simultaneously expanding the Tensil AXI port width, the data transfer speed is effectively increased, the time for internal data movement and calculation is reduced, and thus the overall frame rate is increased. ### Experimental results - In the baseline design, the system achieved an average frame rate of 133.54 frames per second with an accuracy of 90%. - In the dual - clock solution, the frame rate is increased to 152.04 frames per second, showing significant performance improvement. - By using Ultra RAM, the memory access efficiency is further optimized and the data processing speed in the inference process is increased. In general, through a series of technological innovations, the paper successfully solves the performance bottleneck problems faced by FPGA in high - performance computing applications and shows the great potential of FPGA in machine - learning inference tasks.