Quantization without Tears

Minghao Fu,Hao Yu,Jie Shao,Junjie Zhou,Ke Zhu,Jianxin Wu
2024-11-21
Abstract:Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization paradigms.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several key challenges existing in current network quantization methods: 1. **Trade - off between speed and accuracy**: Among the existing quantization methods, post - training quantization (PTQ) without training is fast, but its accuracy in the inference stage is usually low; while quantization - aware training (QAT) which requires training can achieve higher accuracy, but its quantization process is slow. 2. **Complexity**: Current quantization methods are often very complex and sensitive, and a large number of hyper - parameters need to be adjusted for each specific task. Even if one hyper - parameter is not properly set, it may seriously affect the model performance, which limits the generality of these methods in different models and tasks. 3. **Lack of generality**: Due to the above - mentioned complexity, the existing quantization methods are usually only optimized for specific models or tasks, and different models or tasks may require different quantization methods. To address these problems, the paper proposes the "Quantization without Tears" (QwT) method, aiming to simultaneously achieve quantization speed, accuracy, simplicity and generality. The core idea of QwT is to introduce a lightweight additional structure in the quantized network to reduce the information loss during the quantization process. This additional structure consists of only a few linear layers, maintaining the simplicity and efficiency of the method. More importantly, it provides a closed - form solution, which can significantly improve the accuracy within less than two minutes. Extensive experiments show that QwT is not only highly effective but also has strong generality and is suitable for various visual, linguistic and multimodal tasks.