Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

Renzo Andri,Beatrice Bussolino,Antonio Cipolletta,Lukas Cavigelli,Zhe Wang

DOI: https://doi.org/10.48550/arXiv.2209.12982

2022-09-27

Abstract:Most of today's computer vision pipelines are built around deep neural networks, where convolution operations require most of the generally high compute effort. The Winograd convolution algorithm computes convolutions with fewer MACs compared to the standard algorithm, reducing the operation count by a factor of 2.25x for 3x3 convolutions when using the version with 2x2-sized tiles $F_2$. Even though the gain is significant, the Winograd algorithm with larger tile sizes, i.e., $F_4$, offers even more potential in improving throughput and energy efficiency, as it reduces the required MACs by 4x. Unfortunately, the Winograd algorithm with larger tile sizes introduces numerical issues that prevent its use on integer domain-specific accelerators and higher computational overhead to transform input and output data between spatial and Winograd domains. To unlock the full potential of Winograd $F_4$, we propose a novel tap-wise quantization method that overcomes the numerical issues of using larger tiles, enabling integer-only inference. Moreover, we present custom hardware units that process the Winograd transformations in a power- and area-efficient way, and we show how to integrate such custom modules in an industrial-grade, programmable DSA. An extensive experimental evaluation on a large set of state-of-the-art computer vision benchmarks reveals that the tap-wise quantization algorithm makes the quantized Winograd $F_4$ network almost as accurate as the FP32 baseline. The Winograd-enhanced DSA achieves up to 1.85x gain in energy efficiency and up to 1.83x end-to-end speed-up for state-of-the-art segmentation and detection networks.

Hardware Architecture,Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to overcome the numerical problems encountered when using the Winograd convolution algorithm with a larger tile size (such as 4x4) on dedicated integer - domain accelerators (DSAs) by proposing a new tap - wise quantization method. Specifically: 1. **Numerical stability issues**: A larger Winograd tile size (such as F4) can significantly reduce the number of multiply - accumulate operations (MACs), but it introduces numerical instability, which hinders its direct application on dedicated integer - domain accelerators. The paper proposes a new tap - wise quantization method. By learning hardware - friendly power - of - 2 scaling factors for each tap, this problem is solved, making integer - based inference possible. 2. **Complex transformation operations**: The input, output, and weight transformations in the Winograd algorithm involve multiple small matrix multiplications and data layout rearrangement operations, which are difficult to handle efficiently on modern high - throughput matrix multiplication engines. The paper explores the design space of custom hardware modules to implement these low - arithmetic - intensity operations in an area - and power - efficient manner. 3. **Coordination of heterogeneous operations**: After adding the Winograd algorithm, the heterogeneity of computational operations increases, making the coordination of data movement and computation more complex. Moreover, although the Winograd algorithm reduces the computational complexity of convolution operations, it also reduces the opportunities for data reuse and has higher requirements for memory bandwidth. The paper shows how to integrate the Winograd transformation engine into an industrial - level, programmable AI accelerator and adjust the micro - architectures of these blocks to match the throughput of data movement, Winograd transformation, and computational operations, maximizing the overall computational efficiency. Through the above methods, the paper not only improves the accuracy of the Winograd F4 - based convolutional network, approaching the FP32 baseline, but also significantly improves energy efficiency and end - to - end speed, especially in computationally intensive convolutional layers. These improvements are of great significance for enhancing the performance of deep - learning models on edge devices and in data centers.

Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

End-to-End Deployment of Winograd-Based DNNs on Edge GPU

A tile-fusion method for accelerating Winograd convolutions

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Low Power Inference for On-Device Visual Recognition with a Quantization-Friendly Solution.

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Training Transformers with 4-bit Integers

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Trainable Power-of-2 Scale Factors for Hardware-friendly Network Quantization

Accelerating Neural Network Inference by Overflow Aware Quantization

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

FQ-Conv: Fully Quantized Convolution for Efficient and Accurate Inference

Training and inference for integer-based semantic segmentation network

A Quantization-Friendly Separable Convolution for MobileNets.

Faster Inference of Integer SWIN Transformer by Removing the GELU Activation

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

HAWQV3: Dyadic Neural Network Quantization