Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities
Ahmad Tarraf,Martin Schreiber,Alberto Cascajo,Jean-Baptiste Besnard,Marc-André Vef,Dominik Huber,Sonja Happ,André Brinkmann,David E. Singh,Hans-Christian Hoppe,Alberto Miranda,Antonio J. Peña,Rui Machado,Marta Garcia-Gasulla,Martin Schulz,Paul Carpenter,Simon Pickartz,Tiberiu Rotaru,Sergio Iserte,Victor Lopez,Jorge Ejarque,Heena Sirwani,Jesus Carretero,Felix Wolf
DOI: https://doi.org/10.1109/tpds.2024.3406764
IF: 5.3
2024-07-20
IEEE Transactions on Parallel and Distributed Systems
Abstract:With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications' configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for more than two decades. This article presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.
computer science, theory & methods,engineering, electrical & electronic