PAME: Precision-Aware Multi-Exit DNN Serving for Reducing Latencies of Batched Inferences

Shulai Zhang,Weihao Cui,Quan Chen,Zhengnian Zhang,Yue Guan,Jingwen Leng,Chao Li,Minyi Guo
DOI: https://doi.org/10.1145/3524059.3532366
2022-01-01
Abstract:In emerging DNN serving systems, queries are usually batched to fully leverage hardware resources, and all the queries in a batch run through the complete model and return at the same time. According to our findings, some queries only need to pass through a portion of the DNN model to attain sufficient precision in a DNN service. These queries can have shorter latencies if they can return early in the middle of a model. Therefore, we propose precision-aware multi-exit inference serving, PAME, to achieve the above purpose. PAME provides a holistic scheme to build a multi-exit DNN model and a corresponding system-level design of the inference engine. We use representative CV and NLP benchmarks to evaluate PAME. PAME is adaptive to various DNN tasks and service loads. Experimental results show that PAME reduces 39.9% average latency without increasing the tail latency, while maintaining 99.68% precision of the original single-exit DNN models on average.
What problem does this paper attempt to address?