Communication-aware Quantization for Deep Learning Inference Parallelization on Chiplet-based Accelerators

Kaiwei Zou,Songyun Qu,Wen Li,Ying Wang,Huawei Li,Yongpan Liu
DOI: https://doi.org/10.1109/icpads60453.2023.00165
2023-01-01
Abstract:It has recently become trendy for neural network accelerators to scale from single-core to chiplet-based multichip architecture, as the growth of neural network depth and complexity are calling for the promotion of computation and memory capabilities. However, the unintended extensive inter-chip communication of chiplet-based accelerator may bottleneck the parallelism of deep learning inference, which is undesirable for many real-time applications and energy-efficient devices. Although it is imperative for novel schemes to be devised to alleviate this problem, related works are scarce. In this work, we present CampQ, a fine-grained communication-aware mixed-precision quantization method to accelerate inference parallelization by reducing the major inter-chiplet communication overhead. By leveraging the AutoML technique, CampQ is capable of determining different bit-width to activation groups according to thier transmission distances in on-package network. The experimental results show 1.4×-2.6× performance benefits and 29%-60% energy reduction over the 16-bit models for various neural networks and parallelism approaches.
What problem does this paper attempt to address?