Abstract:The design of convolutional neural network (CNN) hardware accelerators based on a single computing engine (CE) architecture or multi-CE architecture has received widespread attention in recent years. Although this kind of hardware accelerator has advantages in hardware platform deployment flexibility and development cycle, it is still limited in resource utilization and data throughput. When processing large feature maps, the speed can usually only reach 10 frames/s, which does not meet the requirements of application scenarios, such as autonomous driving and radar detection. To solve the above problems, this article proposes a full pipeline hardware accelerator design based on pixel. By pixel-by-pixel strategy, the concept of the layer is downplayed, and the generation method of each pixel of the output feature map (Ofmap) can be optimized. To pipeline the entire computing system, we expand each layer of the neural network into hardware, eliminating the buffers between layers and maximizing the effect of complete connectivity across the entire network. This approach has yielded excellent performance. Besides that, as the pixel data stream is a fundamental paradigm in image processing, our fully pipelined hardware accelerator is universal for various CNNs (MobileNetV1, MobileNetV2 and FashionNet) in computer vision. As an example, the accelerator for MobileNetV1 achieves a speed of 4205.50 frames/s and a throughput of 4787.15 GOP/s at 211 MHz, with an output latency of 0.60 ms per image. This extremely shorts processing time and opens the door for AI's application in high-speed scenarios.

A Power-Efficient Accelerator for Convolutional Neural Networks

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A Power-Efficient and High Performance FPGA Accelerator for Convolutional Neural Networks: Work-in-progress.

A High-Performance Accelerator for Large-Scale Convolutional Neural Networks

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

A High Efficient Architecture for Convolution Neural Network Accelerator

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

FPGA-based Accelerator for Convolutional Neural Network

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

A Low-Power Hardware Architecture for Real-Time CNN Computing

Efficient Hardware Architectures for Deep Convolutional Neural Network

A Parallel Loading Based Accelerator for Convolution Neural Network

A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

An Asynchronous Energy-Efficient CNN Accelerator with Reconfigurable Architecture.

CNN hardware acceleration on a low-power and low-cost APSoC