Astraea: towards QoS-aware and resource-efficient multi-stage GPU services

Wei Zhang,Quan Chen,Kaihua Fu,Ningxin Zheng,Zhiyi Huang,Jingwen Leng,Minyi Guo
DOI: https://doi.org/10.1145/3503222.3507721
2022-02-28
Abstract:Multi-stage user-facing applications on GPUs are widely-used nowa- days, and are often implemented to be microservices. Prior re- search works are not applicable to ensuring QoS of GPU-based microservices due to the different communication patterns and shared resource contentions. We propose Astraea to manage GPU microservices considering the above factors. In Astraea, a microser- vice deployment policy is used to maximize the supported peak service load while ensuring the required QoS. To adaptively switch the communication methods between microservices according to different deployments, we propose an auto-scaling GPU communi- cation framework. The framework automatically scales based on the currently used hardware topology and microservice location, and adopts global memory-based techniques to reduce intra-GPU communication. Astraea increases the supported peak load by up to 82.3% while achieving the desired 99%-ile latency target compared with state-of-the-art solutions.
What problem does this paper attempt to address?