Adaptive QoS-aware Microservice Deployment with Excessive Loads Via Intra- and Inter-Datacenter Scheduling
Jiuchen Shi,Kaihua Fu,Jiawen Wang,Quan Chen,Deze Zeng,Minyi Guo
DOI: https://doi.org/10.1109/tpds.2024.3425931
IF: 5.3
2024-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:User-facing applications often experience excessive loads and are shifting towards the microservice architecture. To fully utilize heterogeneous resources, current datacenters have adopted the disaggregated storage and compute architecture, where the storage and compute clusters are suitable to deploy the stateful and stateless microservices, respectively. Moreover, when the local datacenter has insufficient resources to host excessive loads, a reasonable solution is moving some microservices to remote datacenters. However, it is nontrivial to decide the appropriate microservice deployment inside the local datacenter and identify the appropriate migration decision to remote datacenters, as microservices show different characteristics, and the local datacenter shows different resource contention situations. We therefore propose ELIS, an intra- and inter-datacenter scheduling system that ensures the Quality-of-Service (QoS) of the microservice application, while minimizing the network bandwidth usage and computational resource usage. ELIS comprises a resource manager, a cross-cluster microservice deployer, and a reward-based microservice migrator. The resource manager allocates near-optimal resources for microservices while ensuring QoS. The microservice deployer deploys the microservices between the storage and compute clusters in the local datacenter, to minimize the network bandwidth usage while satisfying the microservice resource demand. The microservice migrator migrates some microservices to remote datacenters when local resources cannot afford the excessive loads. Experimental results show that ELIS ensures the QoS of user-facing applications. Meanwhile, it reduces the public network bandwidth usage, the remote computational resource usage, and the local network bandwidth usage by 49.6%, 48.5%, and 60.7% on average, respectively.