Online Resource Management in Thermal and Energy Constrained Heterogeneous High Performance Computing

Mark A. Oxley,Sudeep Pasricha,Anthony A. Maciejewski,Howard Jay Siegel,Patrick J. Burns
DOI: https://doi.org/10.1109/dasc-picom-datacom-cyberscitec.2016.111
2016-08-01
Abstract:Operators of high-performance computing (HPC) facilities face conflicting trade-offs between the operating temperature of the facility, reliability of compute nodes, energy costs, and computing performance. Intelligent management of the HPC facility typically involves taking a proactive approach by predicting the thermal implications of allocating tasks to different cores around the facility. This offers the benefit of operating the HPC facility at a hotter CRAC temperature while avoiding hotspots. However, such an approach can be a time-consuming process that requires complicated air flow models to be calculated for every mapping decision. We propose a framework in which offline analysis is used to assist an online resource manager by predicting the thermal implications of mapping a given workload. The goal is to maximize the reward earned from completing tasks by their individual deadlines throughout the day, while adhering to a daily energy budget and temperature threshold constraints. We show that our proposed techniques can earn significantly greater reward than traditional load balancing and thermal management schemes.
What problem does this paper attempt to address?