Abstract:Recent literature has shown that convolutional neural networks (CNNs) with large kernels outperform vision transformers (ViTs) and CNNs with stacked small kernels in many computer vision tasks, such as object detection and image restoration. The Winograd transformation helps reduce the number of repetitive multiplications in convolution and is widely supported by many commercial AI processors. Researchers have proposed accelerating large kernel convolutions by linearly decomposing them into many small kernel convolutions and then sequentially accelerating each small kernel convolution with the Winograd algorithm. This work proposes a nested Winograd algorithm that iteratively decomposes a large kernel convolution into small kernel convolutions and proves it to be more effective than the linear decomposition Winograd transformation algorithm. Experiments show that compared to the linear decomposition Winograd algorithm, the proposed algorithm reduces the total number of multiplications by 1.4 to 10.5 times for computing 4x4 to 31x31 convolutions.

What problem does this paper attempt to address?

This paper is primarily dedicated to addressing the efficient computation of large convolution kernels in Convolutional Neural Networks (CNNs). Specifically, the paper proposes a new "Nested Winograd Transformation" algorithm to overcome the limitations of existing Winograd transformation methods when dealing with large convolution kernels. ### Overview of the Problem Addressed by the Paper 1. **Background and Motivation**: - Recent studies have shown that CNNs using large convolution kernels outperform Vision Transformers (ViTs) and CNNs with stacked small convolution kernels in many computer vision tasks, such as object detection and image restoration. - The Winograd transformation is widely used to reduce the number of redundant multiplication operations in convolution computations, thereby improving computational efficiency. - Existing methods utilize the Winograd transformation by linearly decomposing large convolution kernels into multiple small convolution kernels, but this approach does not fully exploit the advantage of the Winograd transformation in reducing redundant computations. 2. **Proposed Method**: - A Nested Winograd Transformation algorithm is proposed, which iteratively decomposes large convolution kernels into a series of small convolution kernels and demonstrates that this method is more efficient than the linear decomposition Winograd transformation algorithm. - Experimental results show that compared to the linear decomposition Winograd algorithm, the Nested Winograd algorithm can reduce the number of multiplication operations by 1.4 to 10.5 times when computing convolutions of sizes ranging from 4×4 to 31×31. 3. **Summary of Contributions**: - A Nested Winograd Transformation algorithm is proposed to accelerate the execution of large convolution kernels and demonstrate its performance superiority over existing techniques. - An accelerator architecture and runtime system are designed to utilize the Nested Winograd Transformation to accelerate the computation of convolution kernels of any size, and its effectiveness is validated through FPGA. ### Methodology Details - **Background**: Introduces the basic principles of the Winograd transformation and how it is applied in convolution computations, including the definition of transformation matrices and the computation process. - **Linear Decomposition Winograd Algorithm**: Describes how to linearly decompose large convolution kernels into multiple small convolution kernels and then use the Winograd transformation for acceleration. - **Nested Winograd Algorithm**: Elaborates on the working principle of the Nested Winograd Transformation, including steps such as input transformation, kernel transformation, and output transformation, and provides algorithm analysis. - **Accelerator Design**: Describes the design scheme of the accelerator, including hardware architecture, instruction decoding, computation pipeline, and other modules. - **Experimental Results**: Compares the multiplication complexity of the Nested Winograd algorithm with the linear decomposition Winograd algorithm and native convolution through simulation experiments and demonstrates performance improvements in practical applications. In summary, this paper aims to address the inefficiency of large convolution kernel computations by proposing a new Nested Winograd Transformation algorithm and validates its effectiveness and superiority through theoretical analysis and experimental verification.

Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

Accelerating Large Kernel Convolutions with Nested Winograd Transformation

A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

DWM: A Decomposable Winograd Method for Convolution Acceleration

A tile-fusion method for accelerating Winograd convolutions

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

Shift-ConvNets: Small Convolutional Kernel with Large Kernel Effects

An Efficient Accelerator with Winograd for Novel Convolutional Neural Networks

3D-NWA: A Nested-Winograd Accelerator for 3D CNNs

Winograd Algorithm For 3d Convolution Neural Networks

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Low-Rank Winograd Transformation for 3D Convolutional Neural Networks

A Novel GPU-Based Efficient Approach for Convolutional Neural Networks with Small Filters

InceptionNeXt: When Inception Meets ConvNeXt

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

Enabling Sparse Winograd Convolution by Native Pruning

Layer-Wise Training To Create Efficient Convolutional Neural Networks

Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs

Optimizing Winograd Convolution on ARMv8 processors

Enabling Efficient Fast Convolution Algorithms on GPUs Via MegaKernels