Abstract:Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocation is a challenging problem as the search space grows exponentially with the number of layers in the network. In this paper, we propose QBitOpt, a novel algorithm for updating bitwidths during quantization-aware training (QAT). We formulate the bitwidth allocation problem as a constraint optimization problem. By combining fast-to-compute sensitivities with efficient solvers during QAT, QBitOpt can produce mixed-precision networks with high task performance guaranteed to satisfy strict resource constraints. This contrasts with existing mixed-precision methods that learn bitwidths using gradients and cannot provide such guarantees. We evaluate QBitOpt on ImageNet and confirm that we outperform existing fixed and mixed-precision methods under average bitwidth constraints commonly found in the literature.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to allocate the bit - width of each layer efficiently and accurately during the neural network quantization process in order to maximize task performance while meeting strict resource constraints. Specifically, the paper proposes a new algorithm, QBitOpt, for updating the bit - width during quantization - aware training (QAT). By combining the quickly - calculated sensitivity with an efficient solver, QBitOpt can ensure that the mixed - precision network meets strict resource limitations while guaranteeing high task performance. This is different from existing mixed - precision methods, which use gradients to learn the bit - width but cannot provide such a guarantee. The main contributions of the paper include: - Outputting a quantized neural network that is guaranteed to meet resource constraints. Most existing methods rely on hyper - parameter search to balance accuracy and resource constraints, but cannot guarantee that the constraint conditions are met. - By formulating the bit - width allocation problem as a constrained convex optimization problem, this method can be extended to networks using many quantizers and can be solved quickly and effectively using off - the - shelf software. - For the first time, integrating optimization - based bit - width allocation with existing quantization - aware training methods and outperforming competing mixed - precision methods in ImageNet classification under the average bit - width constraint. - Demonstrating that updating the bit - width allocation during training is crucial for optimal performance and is superior to the common method of post - training bit - width allocation followed by quantization - aware fine - tuning. These contributions address the key challenges in current mixed - precision quantization (MPQ) methods, namely, how to find the optimal bit - width for each layer while maintaining high task performance in a resource - constrained situation.

QBitOpt: Fast and Accurate Bitwidth Reallocation during Training

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Propagating Asymptotic-Estimated Gradients for Low Bitwidth Quantized Neural Networks

AdaQAT: Adaptive Bit-Width Quantization-Aware Training

Efficient Bitwidth Search for Practical Mixed Precision Neural Network

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

PTMQ: Post-training Multi-Bit Quantization of Neural Networks

Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

Mixed-Precision Neural Network Quantization Via Learned Layer-Wise Importance

BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of DNNs from Scratch

OMPQ: Orthogonal Mixed Precision Quantization

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment

Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection.

BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

Optimization-based Post-training Quantization with Bit-split and Stitching

Mixed-precision Deep Neural Network Quantization With Multiple Compression Rates

Mixed-precision quantized neural networks with progressively decreasing bitwidth