Abstract:Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.

Accelerating Large Kernel Convolutions with Nested Winograd Transformation

Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

DWM: A Decomposable Winograd Method for Convolution Acceleration

A tile-fusion method for accelerating Winograd convolutions

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

3D-NWA: A Nested-Winograd Accelerator for 3D CNNs

Shift-ConvNets: Small Convolutional Kernel with Large Kernel Effects

An Efficient Accelerator with Winograd for Novel Convolutional Neural Networks

Winograd Algorithm For 3d Convolution Neural Networks

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

InceptionNeXt: When Inception Meets ConvNeXt

Low-Rank Winograd Transformation for 3D Convolutional Neural Networks

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs

A Novel GPU-Based Efficient Approach for Convolutional Neural Networks with Small Filters

Optimizing Winograd Convolution on ARMv8 processors

Enabling Sparse Winograd Convolution by Native Pruning

Layer-Wise Training To Create Efficient Convolutional Neural Networks

Dimension Fusion: Dimension-level Dynamically Composable Accelerator for Convolutional Neural Networks