Abstract:Conventional model quantization methods use a fixed quantization scheme to different data samples, which ignores the inherent "recognition difficulty" differences between various samples. We propose to feed different data samples with varying quantization schemes to achieve a data-dependent dynamic inference, at a fine-grained layer level. However, enabling this adaptive inference with changeable layer-wise quantization schemes is challenging because the combination of bit-widths and layers is growing exponentially, making it extremely difficult to train a single model in such a vast searching space and use it in practice. To solve this problem, we present the Arbitrary Bit-width Network (ABN), where the bitwidths of a single deep network can change at runtime for different data samples, with a layer-wise granularity. Specifically, first we build a weight-shared layer-wise quantizable "super-network" in which each layer can be allocated with multiple bit-widths and thus quantized differently on demand. The super-network provides a considerably large number of combinations of bit-widths and layers, each of which can be used during inference without retraining or storing myriad models. Second, based on the well-trained super-network, each layer's runtime bit-width selection decision is modeled as a Markov Decision Process (MDP) and solved by an adaptive inference strategy accordingly. Experiments show that the super-network can be built without accuracy degradation, and the bit-widths allocation of each layer can be adjusted to deal with various inputs on the fly. On ImageNet classification, we achieve 1.1% top1 accuracy improvement while saving 36.2% BitOps.

General Bitwidth Assignment for Efficient Deep Convolutional Neural Network Quantization

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Propagating Asymptotic-Estimated Gradients for Low Bitwidth Quantized Neural Networks

Instance-Aware Dynamic Neural Network Quantization

Efficient Bitwidth Search for Practical Mixed Precision Neural Network

Space Efficient Quantization for Deep Convolutional Neural Networks

Direct Quantization for Training Highly Accurate Low Bit-width Deep Neural Networks

Adaptive Layerwise Quantization for Deep Neural Network Compression

Distribution-aware Adaptive Multi-bit Quantization

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Effective Training of Convolutional Neural Networks with Low-bitwidth Weights and Activations

Quantization Networks

Class-based Quantization for Neural Networks

Weighted-Entropy-Based Quantization for Deep Neural Networks

Neural Network Activation Quantization with Bitwise Information Bottlenecks

Residual Quantization for Low Bit-Width Neural Networks

Fixed Point Quantization of Deep Convolutional Networks

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Weight Normalization based Quantization for Deep Neural Network Compression