Abstract:Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization paradigms.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key challenges existing in current network quantization methods: 1. **Trade - off between speed and accuracy**: Among the existing quantization methods, post - training quantization (PTQ) without training is fast, but its accuracy in the inference stage is usually low; while quantization - aware training (QAT) which requires training can achieve higher accuracy, but its quantization process is slow. 2. **Complexity**: Current quantization methods are often very complex and sensitive, and a large number of hyper - parameters need to be adjusted for each specific task. Even if one hyper - parameter is not properly set, it may seriously affect the model performance, which limits the generality of these methods in different models and tasks. 3. **Lack of generality**: Due to the above - mentioned complexity, the existing quantization methods are usually only optimized for specific models or tasks, and different models or tasks may require different quantization methods. To address these problems, the paper proposes the "Quantization without Tears" (QwT) method, aiming to simultaneously achieve quantization speed, accuracy, simplicity and generality. The core idea of QwT is to introduce a lightweight additional structure in the quantized network to reduce the information loss during the quantization process. This additional structure consists of only a few linear layers, maintaining the simplicity and efficiency of the method. More importantly, it provides a closed - form solution, which can significantly improve the accuracy within less than two minutes. Extensive experiments show that QwT is not only highly effective but also has strong generality and is suitable for various visual, linguistic and multimodal tasks.

Quantization without Tears

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Quantization Networks

VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

HAWQV3: Dyadic Neural Network Quantization

Deep Neural Network Compression With Single and Multiple Level Quantization

Instance-Aware Dynamic Neural Network Quantization

Iterative Deep Neural Network Quantization with Lipschitz Constraint

Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss

Bit-Quantized-Net: an Effective Method for Compressing Deep Neural Networks.

SQuant: On-the-Fly Data-Free Quantization Via Diagonal Hessian Approximation

Adaptive Layerwise Quantization for Deep Neural Network Compression

A White Paper on Neural Network Quantization

Μl2q: an Ultra-Low Loss Quantization Method for DNN Compression

Two-Step Quantization for Low-bit Neural Networks

Adaptive Quantization for Deep Neural Network

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.

Quantization of Deep Neural Networks for Accurate Edge Computing

CSMPQ: Class Separability Based Mixed-Precision Quantization.