Sinan: Data Driven Resource Management for Cloud Microservices

Yanqi Zhang,Weizhe Hua,Zhuangzhuang Zhou,Ed Suh,Christina Delimitrou
DOI: https://doi.org/10.48550/arXiv.2112.06254
2021-12-12
Abstract:Cloud applications are increasingly shifting to interactive and loosely-coupled microservices. Despite their advantages, microservices complicate resource management, due to inter-tier dependencies. We present Sinan, a cluster manager for interactive microservices that leverages easily-obtainable tracing data instead of empirical decisions, to infer the impact of a resource allocation on on end-to-end performance, and allocate appropriate resources to each tier. In a preliminary evaluation of Sinan with an end-to-end social network built with microservices, we show that Sinan's data-driven approach, allows the service to always meet its QoS without sacrificing resource efficiency.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the complexity and challenges of resource management in the microservice architecture in the cloud - computing environment. Specifically, the paper points out: 1. **Large action space**: Due to the frequent changes in application behavior, resource management decisions need to be made online. This means that the resource manager must traverse the space containing all possible resource allocations for each microservice in a practical way. Suppose there are \(N\) microservice levels and a pool containing \(C\) (\(C \geq N\)) homogeneous physical cores, and each core has \(F\) frequency levels, then the size of the action space is \(\binom{C - 1}{N - 1}\cdot N^F\). For example, in a cluster with 150 cores, assuming that each level has 10 frequency steps, the size of the resource allocation space for social network applications is \(7.78\times 10^{55}\). Performing performance evaluations on the configurations of all actions under different loads requires a large amount of time and computing resources. Therefore, there is an urgent need for efficient action - space pruning methods and statistical tools with strong generalization capabilities to support resource scheduling. 2. **Queuing effect of latency**: In a queuing system with a processing throughput \(T_o\) and a latency Quality of Service (QoS) target \(Q\), \(T_o\) is a non - decreasing function of the allocated resource \(R\). To meet the QoS and keep the system stable while using the least amount of resource \(R\), the input load \(T_i\) should be equal to or slightly higher than \(T_o\). Even if \(R\) is reduced to \(T_o < T_i\), the QoS will not be violated immediately because it takes time for the queue to accumulate. Conversely, when the QoS is violated, even if resources are increased immediately, it takes a long time for the established queue to be drained. Multilevel microservices are a complex queuing system, and queues exist between and within microservices. This queuing effect of latency emphasizes that the machine - learning model needs to evaluate the long - term impact of resource management actions and proactively prevent the resource manager from reducing resources too aggressively to avoid introducing a long recovery period. To avoid QoS violations, the manager must increase resources in advance; otherwise, even if more resources are allocated subsequently, QoS violations are inevitable. 3. **Inter - level dependencies**: Another complex factor in microservice resource management is that dependent microservices are not perfect pipelines, so back - pressure effects that are difficult to detect and prevent may be introduced. These dependencies may be further exacerbated by specific Remote Procedure Call (RPC) and data - storage API implementations. Therefore, the resource scheduler should have a global perspective and be able to predict the impact of dependencies on end - to - end performance. To solve these problems, the paper proposes Sinan, a machine - learning - based cluster manager, which aims to infer the impact of resource allocation on end - to - end performance by using trace data in the cloud and a series of practical machine - learning techniques, and allocate appropriate resources to each application layer. Sinan adopts a hybrid approach, using a Convolutional Neural Network (CNN) to predict the end - to - end latency in the next decision interval and a Boosted Trees model to predict the probability of QoS violations in the more distant future. This method not only improves resource efficiency but also ensures service quality.