Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

Jingbo Jiang,Xizi Chen,Chi-Ying Tsui
DOI: https://doi.org/10.1109/VLSI-SoC57769.2023.10321932
2023-12-31
Abstract:Recent literature has shown that convolutional neural networks (CNNs) with large kernels outperform vision transformers (ViTs) and CNNs with stacked small kernels in many computer vision tasks, such as object detection and image restoration. The Winograd transformation helps reduce the number of repetitive multiplications in convolution and is widely supported by many commercial AI processors. Researchers have proposed accelerating large kernel convolutions by linearly decomposing them into many small kernel convolutions and then sequentially accelerating each small kernel convolution with the Winograd algorithm. This work proposes a nested Winograd algorithm that iteratively decomposes a large kernel convolution into small kernel convolutions and proves it to be more effective than the linear decomposition Winograd transformation algorithm. Experiments show that compared to the linear decomposition Winograd algorithm, the proposed algorithm reduces the total number of multiplications by 1.4 to 10.5 times for computing 4x4 to 31x31 convolutions.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper is primarily dedicated to addressing the efficient computation of large convolution kernels in Convolutional Neural Networks (CNNs). Specifically, the paper proposes a new "Nested Winograd Transformation" algorithm to overcome the limitations of existing Winograd transformation methods when dealing with large convolution kernels. ### Overview of the Problem Addressed by the Paper 1. **Background and Motivation**: - Recent studies have shown that CNNs using large convolution kernels outperform Vision Transformers (ViTs) and CNNs with stacked small convolution kernels in many computer vision tasks, such as object detection and image restoration. - The Winograd transformation is widely used to reduce the number of redundant multiplication operations in convolution computations, thereby improving computational efficiency. - Existing methods utilize the Winograd transformation by linearly decomposing large convolution kernels into multiple small convolution kernels, but this approach does not fully exploit the advantage of the Winograd transformation in reducing redundant computations. 2. **Proposed Method**: - A Nested Winograd Transformation algorithm is proposed, which iteratively decomposes large convolution kernels into a series of small convolution kernels and demonstrates that this method is more efficient than the linear decomposition Winograd transformation algorithm. - Experimental results show that compared to the linear decomposition Winograd algorithm, the Nested Winograd algorithm can reduce the number of multiplication operations by 1.4 to 10.5 times when computing convolutions of sizes ranging from 4×4 to 31×31. 3. **Summary of Contributions**: - A Nested Winograd Transformation algorithm is proposed to accelerate the execution of large convolution kernels and demonstrate its performance superiority over existing techniques. - An accelerator architecture and runtime system are designed to utilize the Nested Winograd Transformation to accelerate the computation of convolution kernels of any size, and its effectiveness is validated through FPGA. ### Methodology Details - **Background**: Introduces the basic principles of the Winograd transformation and how it is applied in convolution computations, including the definition of transformation matrices and the computation process. - **Linear Decomposition Winograd Algorithm**: Describes how to linearly decompose large convolution kernels into multiple small convolution kernels and then use the Winograd transformation for acceleration. - **Nested Winograd Algorithm**: Elaborates on the working principle of the Nested Winograd Transformation, including steps such as input transformation, kernel transformation, and output transformation, and provides algorithm analysis. - **Accelerator Design**: Describes the design scheme of the accelerator, including hardware architecture, instruction decoding, computation pipeline, and other modules. - **Experimental Results**: Compares the multiplication complexity of the Nested Winograd algorithm with the linear decomposition Winograd algorithm and native convolution through simulation experiments and demonstrates performance improvements in practical applications. In summary, this paper aims to address the inefficiency of large convolution kernel computations by proposing a new Nested Winograd Transformation algorithm and validates its effectiveness and superiority through theoretical analysis and experimental verification.