Abstract:To reduce multiplication operations in convolution of convolutional neural networks (CNNs), there are three widely used convolutional acceleration algorithms, i.e., Winograd, FFT and FFA. However, current accelerators based on these convolutional acceleration algorithms have issues on flexibility and efficiency. Firstly, some accelerators utilized a combination of these acceleration algorithms and employed multiple types of computational units to achieve their respective advantages. As a result, some computational units are left unused when the best-performing unit is working, which causes much area inefficiency. Secondly, current accelerators tend to choose small parameters of these convolutional acceleration algorithms to avoid unacceptable precision loss, as a result, they are hardly to support large kernel sizes and lack of flexibility. Thirdly, these acceleration algorithms are typically presented for 1-stride convolutions, consequently, few implementation considers the acceleration of large-stride convolutions, which is a major restriction to hardware flexibility. This paper proposed a stride-based convolution decomposition method (SCDM) to reform different convolution shapes (i.e., kernel sizes & strides) to an identical pattern. With the aid of SCDM, a Winograd-stretched and hardware-efficient design (WHD) is presented to utilize one uniform computational unit for the acceleration of different convolution shapes, which combines complementary performance advantages on both Winograd F(4,3)andF(4,2) units. Compared to current FFT-based or FFA-based works, WHD can stretch the use range of Winograd and simplify implementation, thereby achieving hardware flexibility and efficiency. Evaluation results show that 34.08%~55.41% operation reduction were achieved on six CNN models, while incurring a slight hardware overhead.

Acceleration Performance Study of Convolutional Neural Network Based on Split-radix-2/(2a) FFT Algorithms

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

A High-Throughput and Flexible CNN Accelerator Based on Mixed-Radix FFT Method

Convolution Without Multiplication: A General Speed Up Strategy for CNNs

Sensitivity-Oriented Layer-Wise Acceleration and Compression for Convolutional Neural Network.

Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

A High Efficient Architecture for Convolution Neural Network Accelerator

A Parallel Loading Based Accelerator for Convolution Neural Network

Using Fermat Number Transform to Accelerate Convolutional Neural Network.

A High Utilization FPGA-Based Accelerator for Variable-Scale Convolutional Neural Network

A GPU-based high-performance optimization method of sparse convolutional neural networks

NUMA-aware FFT-based Convolution on ARMv8 Many-core CPUs

Recent Advances in Convolutional Neural Network Acceleration

Accelerating convolutional neural network by exploiting sparsity on GPUs

A Fast Algorithm for Convolutional Neural Networks Using Tile-based Fast Fourier Transforms

A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation

A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator.

High Performance Convolutional Neural Network Accelerator Based on Design Space Exploration

Sensitivity-based Acceleration and Compression Algorithm for Convolution Neural Network.