Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

Georg Rutishauser,Francesco Conti,Luca Benini
2023-07-06
Abstract:Mixed-precision quantization, where a deep neural network's layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to optimize the inference latency of the model while maintaining statistical accuracy when deploying a mixed - precision quantized neural network on edge devices. Specifically, the paper proposes a hybrid search method, including a hardware - independent differentiable search algorithm and hardware - aware heuristic optimization, to find a low - latency mixed - precision configuration for a specific hardware target. This method aims to surpass the trade - offs in model size, latency, and statistical accuracy that can be achieved by single - precision quantization. In particular, MobileNetV1 and MobileNetV2 are evaluated on a multi - core RISC - V microcontroller platform, showing that the end - to - end latency can be reduced by up to 28.6% compared to the 8 - bit model with almost no impact on accuracy.