Improving Cluster Resource Efficiency with Oversubscription

Jie Chen,Chun Cao,Ying Zhang,Xiaoxing Ma,Haiwei Zhou,Chengwei Yang
DOI: https://doi.org/10.1109/compsac.2018.00027
2018-01-01
Abstract:Volumes of studies on resource scheduling are proposed to improve the efficiency of computing clusters. As users usually overestimate the resource requirements for their jobs, further, most schedulers ignore the dynamic variation of jobs' runtime resource usage, the utilization of real-world clusters is significantly limited. In this paper, we argue that resource oversubscription, which allocates more resources than the physical capacity, is a necessary complement to existing systems. To alleviate resource contention, we augment oversubscription with lightweight prediction and dynamic CPU throttling. We implemented our approach called Datom, which is an extension module of the Apache Mesos cluster manager. We evaluated Datom with real-world video transcoding workloads and simulations with Google cluster trace. The results show that comparing to original Mesos, Datom increased CPU utilization, memory utilization and overall task throughput by up to 22%, 23%, 20% respectively, and shortened jobs' complete time by up to 20%, at the expenses of moderate penalty on job execution.
What problem does this paper attempt to address?