Abstract:The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution. We evaluate ESCALATE with four representative CNN models on both CIFAR-10 and ImageNet datasets and compare it against previous sparse accelerators and pruning algorithms. Results show that ESCALATE can achieve up to 325 × and 11 × compression ratio for models on CIFAR-10 and ImageNet, respectively. Comparing with previous dense and sparse accelerators, ESCALATE accelerator averagely boosts the energy efficiency by 8.3 × and 3.77 ×, and reduces the latency by 17.9 × and 2.16 ×, respectively.

A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

A High-Throughput and Flexible CNN Accelerator Based on Mixed-Radix FFT Method

WRA-MF: A Bit-Level Convolutional-Weight-Decomposition Approach to Improve Parallel Computing Efficiency for Winograd-Based CNN Acceleration

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

MF-Conv: A Novel Convolutional Approach Using Bit-Resolution-based Weight Decomposition to Eliminate Multiplications for CNN Acceleration

FWUA : A Flexible Winograd-Based Uniform Accelerator for 1D/2D/3D CNNs

Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies

ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data–Index Decoupling Workflow

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

WRA-SS: A High-Performance Accelerator Integrating Winograd with Structured Sparsity for Convolutional Neural Networks

Improving the computational efficiency and flexibility of FPGA-based CNN accelerator through loop optimization

A Flexible and Energy-Efficient Convolutional Neural Network Acceleration with Dedicated ISA and Accelerator

A High Utilization FPGA-Based Accelerator for Variable-Scale Convolutional Neural Network

Relative Indexed Compressed Sparse Filter Encoding Format for Hardware-Oriented Acceleration of Deep Convolutional Neural Networks

An algorithm/hardware co‐optimized method to accelerate CNNs with compressed convolutional weights on FPGA