Abstract:Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.

A tile-fusion method for accelerating Winograd convolutions

Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

Dimension Fusion: Dimension-level Dynamically Composable Accelerator for Convolutional Neural Networks

Enabling Efficient Fast Convolution Algorithms on GPUs Via MegaKernels

Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs

Enabling Sparse Winograd Convolution by Native Pruning

A High-efficiency FPGA-based Accelerator for Convolutional Neural Networks using Winograd Algorithm

OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

A Fast Algorithm for Convolutional Neural Networks Using Tile-based Fast Fourier Transforms

FTConv: FPGA Acceleration for Transposed Convolution Layers in Deep Neural Networks

Low-Rank Winograd Transformation for 3D Convolutional Neural Networks

Optimizing Half Precision Winograd Convolution on ARM Many-Core Processors

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Zero and data reuse-aware fast convolution for deep neural networks on GPU

Accelerating convolutional neural network by exploiting sparsity on GPUs

WRA-SS: A High-Performance Accelerator Integrating Winograd with Structured Sparsity for Convolutional Neural Networks

Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA