QBitOpt: Fast and Accurate Bitwidth Reallocation during Training

Jorn Peters,Marios Fournarakis,Markus Nagel,Mart van Baalen,Tijmen Blankevoort
2023-07-10
Abstract:Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocation is a challenging problem as the search space grows exponentially with the number of layers in the network. In this paper, we propose QBitOpt, a novel algorithm for updating bitwidths during quantization-aware training (QAT). We formulate the bitwidth allocation problem as a constraint optimization problem. By combining fast-to-compute sensitivities with efficient solvers during QAT, QBitOpt can produce mixed-precision networks with high task performance guaranteed to satisfy strict resource constraints. This contrasts with existing mixed-precision methods that learn bitwidths using gradients and cannot provide such guarantees. We evaluate QBitOpt on ImageNet and confirm that we outperform existing fixed and mixed-precision methods under average bitwidth constraints commonly found in the literature.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to allocate the bit - width of each layer efficiently and accurately during the neural network quantization process in order to maximize task performance while meeting strict resource constraints. Specifically, the paper proposes a new algorithm, QBitOpt, for updating the bit - width during quantization - aware training (QAT). By combining the quickly - calculated sensitivity with an efficient solver, QBitOpt can ensure that the mixed - precision network meets strict resource limitations while guaranteeing high task performance. This is different from existing mixed - precision methods, which use gradients to learn the bit - width but cannot provide such a guarantee. The main contributions of the paper include: - Outputting a quantized neural network that is guaranteed to meet resource constraints. Most existing methods rely on hyper - parameter search to balance accuracy and resource constraints, but cannot guarantee that the constraint conditions are met. - By formulating the bit - width allocation problem as a constrained convex optimization problem, this method can be extended to networks using many quantizers and can be solved quickly and effectively using off - the - shelf software. - For the first time, integrating optimization - based bit - width allocation with existing quantization - aware training methods and outperforming competing mixed - precision methods in ImageNet classification under the average bit - width constraint. - Demonstrating that updating the bit - width allocation during training is crucial for optimal performance and is superior to the common method of post - training bit - width allocation followed by quantization - aware fine - tuning. These contributions address the key challenges in current mixed - precision quantization (MPQ) methods, namely, how to find the optimal bit - width for each layer while maintaining high task performance in a resource - constrained situation.