Abstract:The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.

USCA: A Unified Systolic Convolution Array Architecture for Accelerating Sparse Neural Network

EWS: an Energy-Efficient CNN Accelerator with Enhanced Weight Stationary Dataflow

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A High-Performance Systolic Array Accelerator Dedicated for CNN.

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

A Reconfigurable Accelerator for Sparse Convolutional Neural Networks.

ALSCA: A Large-Scale Sparse CNN Accelerator Using Position-First Dataflow and Input Channel Merging Approach

Efficient Convolution Architectures for Convolutional Neural Network

UACT: A Unified Energy-efficient Computing Architecture for CNN and TCNN.

An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks

A Sparse CNN Accelerator for Eliminating Redundant Computations in Intra- and Inter-Convolutional/Pooling Layers

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

LACS: A High-Computational-Efficiency Accelerator for CNNs

ARA: Cross-Layer Approximate Computing Framework Based Reconfigurable Architecture for CNNs

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Reconfigurable Spatial-Parallel Stochastic Computing for Accelerating Sparse Convolutional Neural Networks

Optimizing Stochastic Computing for Low Latency Inference of Convolutional Neural Networks

A Systolic Computing-in-Memory Array Based Accelerator with Predictive Early Activation for Spatiotemporal Convolutions

The Storage Structure of Convolutional Neural Network Reconfigurable Accelerator Based on ASIC

Efficient Hardware Architectures for Deep Convolutional Neural Network