Abstract:In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution. In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms. We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.

Efficient depthwise separable convolution accelerator for classification and UAV object detection

A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

Real-time Semantic Segmentation with Weighted Factorized-Depthwise Convolution

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Hardware Implementation of Depthwise Separable Convolution Neural Network

Exploration for Efficient Depthwise Separable Convolution Networks Deployment on FPGA

An Efficient FPGA Accelerator for Point Cloud

SkyNet: A Champion Model for DAC-SDC on Low Power Object Detection

DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

DeepDive: An Integrative Algorithm/Architecture Co-Design for Deep Separable Convolutional Neural Networks

A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

A Lightweight Detection Method for Remote Sensing Images and Its Energy-Efficient Accelerator on Edge Devices

Algorithm-Hardware Co-Optimization for Energy-Efficient Drone Detection on Resource-Constrained FPGA

A Streaming Accelerator for Deep Convolutional Neural Networks with Image and Feature Decomposition for Resource-limited System Applications.

SAS-SEINet: A SNR-Aware Adaptive Scalable SEI Neural Network Accelerator Using Algorithm–Hardware Co-Design for High-Accuracy and Power-Efficient UAV Surveillance

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

SCCMDet: Adaptive Sparse Convolutional Networks Based on Class Maps for Real-Time Onboard Detection in Unmanned Aerial Vehicle Remote Sensing Images

An Energy-Efficient, Unified CNN Accelerator for Real-Time Multi-Object Semantic Segmentation for Autonomous Vehicle

An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network