Characteristics of Co-Allocated Online Services and Batch Jobs in Internet Data Centers: A Case Study From Alibaba Cloud.

Congfeng Jiang,Guangjie Han,Jiangbin Lin,Gangyong Jia,Weisong Shi,Jian Wan
DOI: https://doi.org/10.1109/ACCESS.2019.2897898
IF: 3.9
2019-01-01
IEEE Access
Abstract:In order to reduce power and energy costs, giant cloud providers now mix online and batch jobs on the same cluster. Although the co-allocation of such jobs improves machine utilization, it challenges the data center scheduler and workload assignment in terms of quality of service, fault tolerance, and failure recovery, especially for latency critical online services. In this paper, we explore various characteristics of co-allocated online services and batch jobs from a production cluster containing 1.3k servers in Alibaba Cloud. From the trace data, we`find the following: 1) For batch jobs with multiple tasks and instances, 50.8% failed tasks wait and halted after a very long time interval when their first and the only one instance fails. This wastes much time and resources as the remaining instances are running for an impossible successful termination. 2) For online services jobs, they are clustered in 25 categories according to their requested CPU, memory, and disk resources. Such clustering can help the co-allocation of online services jobs with batch jobs. 3) Servers are clustered into seven groups by CPU utilization, memory utilization, and their correlations. Machines with a strong correlation between CPU and memory utilization provides an opportunity for job co-allocation and resource utilization estimation. 4) The MTBF (mean time between failures) of instances are in the interval [400, 800] seconds while the average completion time of the 99th percentile is 1003 seconds. We also compare the cumulative distribution functions of jobs and servers and explain the differences and opportunities for workload assignment between them. Our`findings and insights presented in this paper can help the community and data center operators better understand the workload characteristics, improve resource utilization, and failure recovery design.
What problem does this paper attempt to address?