Abstract:Deep Convolutional Neural Networks have become a Swiss knife in solving critical artificial intelligence tasks. However, deploying deep CNN models for latency-critical tasks remains to be challenging because of the complex nature of CNNs. Recently, FPGA has become a favorable device to accelerate deep CNNs thanks to its high parallel processing capability and energy efficiency. In this work, we explore different fast convolution algorithms including Winograd and Fast Fourier Transform (FFT), and find an optimal strategy to apply them together on different types of convolutions. We also propose an optimization scheme to exploit parallelism on novel CNN architectures such as Inception modules in GoogLeNet. We implement a configurable IP-based face recognition acceleration system based on FaceNet using High-Level Synthesis. Our implementation on a Xilinx Ultrascale device achieves 3.75x latency speedup compared to a high-end NVIDIA GPU and surpasses previous FPGA results significantly.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently implement deep convolutional neural networks (CNNs) on FPGAs to accelerate face recognition tasks, especially in tasks with low - latency requirements. Specifically, the paper focuses on the following aspects: 1. **Optimizing Convolution Algorithms**: The paper explores different fast convolution algorithms, such as the Winograd minimal filtering algorithm and the algorithm based on the fast Fourier transform (FFT), and proposes a hybrid strategy to select the most suitable algorithm for different convolution types. This helps to achieve the best performance on convolution kernels of different sizes. 2. **Resource Allocation and Parallel Processing**: For the multi - branch structure in modern CNN architectures (such as the Inception module), the paper proposes a new resource allocation scheme to fully utilize the parallel processing capabilities of FPGAs and reduce the overall system latency. This scheme takes into account the parallelism within the module and optimizes performance by dynamically adjusting resource allocation. 3. **Hardware Implementation and Optimization**: The paper uses high - level synthesis (HLS) tools to implement a face recognition system based on Inception V2 and deploys it on Xilinx Ultrascale FPGAs. Through a series of hardware optimization techniques, such as data quantization, resource reuse, and parallel processing, the performance and energy efficiency of the system are improved. 4. **Performance Evaluation**: The paper conducts a detailed performance evaluation of the proposed system, including comparisons with high - end GPUs and other FPGA implementations. The results show that the system is significantly superior to existing solutions in terms of latency and energy efficiency. In summary, the main objective of this paper is to improve the execution efficiency of deep CNNs on FPGAs through algorithm optimization and hardware design, especially in real - time applications such as face recognition.

Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs

8-bit Convolutional Neural Network Accelerator for Face Recognition

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

A Fast Face Recognition System Based on Deep Learning

A FPGA-based Accelerator of Convolutional Neural Network for Face Feature Extraction

A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Accelerating convolutional neural networks on FPGAs

A Scalable FPGA Accelerator for Convolutional Neural Networks.

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

An FPGA-based Accelerator Implementation for Deep Convolutional Neural Networks

Hardware Implementation of Convolutional Neural Network for Face Feature Extraction

Accelerating Face Detection Algorithm On The Fpga Using Sdaccel

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks.

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

FPGA‐accelerated Deep Convolutional Neural Networks for High Throughput and Energy Efficiency

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

Recognition System of Convolution Neural Network Based on FPGA Acceleration

Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

A Memory-Optimized and Energy-Efficient CNN Acceleration Architecture Based on FPGA.