SearchQ: Search-based Fine-Grained Quantization for Data-Free Model Compression

Ning Yang,Fangxin Liu,Zongwu Wang,Junping Zhao,Li Jiang
DOI: https://doi.org/10.1109/tcasai.2024.3491941
2024-01-01
Abstract:The huge memory and computing costs of Deep Neural Networks (DNNs) greatly hinder their deployment on resource-constrained devices with high efficiency. Quantization has emerged as an effective approach to shrinking the model size for memory saving and simplifying the operations for compute acceleration in DNNs. However, previous methods require retraining or fine-tuning after quantization to recover the model accuracy. These alternatives do not apply to confidential scenarios due to personal privacy and security concerns, highlighting the necessity of adopting data-free quantization methods. To this end, we propose a simple yet effective quantization framework, named SearchQ, which leverages a principled search approach to automatically allocate suitable quantization parameters (e.g., data format, bitwidth, etc.) for weights. Thus, the quantization parameters can be adapted to various distributions of weights in different blocks. Based on such locality, we minimize the quantization error by better matching the weight distribution in every block before and after quantization. Since searching separately for each block will lead to an exponentially vast search space, we then used an effective strategy to increase the search speed. Given quantization parameters specified by platform features, SearchQ quantization derives an optimal quantization model for DNN deployment without any model retraining or expensive calculation. Comprehensive experimental results on various computer vision tasks validate that SearchQ achieves better performance and outperforms other state-of-the-art methods of network compression.
What problem does this paper attempt to address?