AsyFunc: A High-Performance and Resource-Efficient Serverless Inference System via Asymmetric Functions

Qiangyu Pei,Yongjie Yuan,Haichuan Hu,Qiong Chen,Fangming Liu
DOI: https://doi.org/10.1145/3620678.3624664
2023-01-01
Abstract:Recent advances in deep learning (DL) have spawned various intelligent cloud services with well-trained DL models. Nevertheless, it is nontrivial to maintain the desired end-to-end latency under bursty workloads, raising critical challenges on high-performance while resource-efficient inference services. To handle burstiness, some inference services have migrated to the serverless paradigm for its rapid elasticity. However, they neglect the impact of the time-consuming and resource-hungry model-loading process when scaling out function instances, leading to considerable resource inefficiency for maintaining high performance under burstiness. To address the issue, we open up the black box of DL models and find an interesting phenomenon that the sensitivity of each layer to the computing resources is mostly anti-correlated with its memory resource usage. Motivated by this, we propose asymmetric functions, where the original Body Function still loads a complete model to meet stable demands, while the proposed lightweight Shadow Function only loads a portion of resource-sensitive layers to deal with sudden demands effortlessly. By parallelizing computations on resource-sensitive layers, the surging demand can be well satisfied, though the rest of the layers are performed serially in Body Functions only. We implement asymmetric functions on top of Knative and build a high-performance and resource-efficient inference serving system named AsyFunc with a new auto-scaling and scheduling engine. Evaluation results driven by production traces show that compared with the state of the art, AsyFunc saves computing and memory resources by up to 31.1% and 32.5%, respectively, while providing consistent performance guarantees under burstiness.
What problem does this paper attempt to address?