Software Resource Disaggregation for HPC with Serverless Computing

Marcin Copik,Marcin Chrapek,Larissa Schmid,Alexandru Calotoiu,Torsten Hoefler
2024-07-26
Abstract:Aggregated HPC resources have rigid allocation systems and programming models which struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this paper, we improve the utilization of supercomputers by employing the new cloud paradigm of serverless computing. We show how serverless functions provide fine-grained access to the resources of batch-managed cluster nodes. We present an HPC-oriented Function-as-a-Service (FaaS) that satisfies the requirements of high-performance applications. We demonstrate a software resource disaggregation approach where placing functions on unallocated and underutilized nodes allows idle cores and accelerators to be utilized while retaining near-native performance.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper primarily addresses the issue of underutilization of resources in High-Performance Computing (HPC) systems and proposes solutions to tackle this problem. Specifically, the paper focuses on the following core issues: 1. **Underutilization of Resources**: Modern HPC systems commonly face the problem of underutilized resources, especially in terms of memory and computational resources. Research indicates that even in top-tier supercomputing systems, a significant amount of memory remains underutilized, and compute nodes can be temporarily idle at certain times. 2. **Limitations of Static Resource Allocation Mechanisms**: Traditional HPC resource allocation mechanisms use static batch allocation methods, which are difficult to adapt to varying and diverse workloads, leading to inefficient resource utilization. 3. **Cost and Complexity of Hardware Solutions**: Although hardware-level resource decoupling technologies can improve resource utilization, these methods usually require expensive hardware investments and complex system reconfigurations. To address these issues, the paper proposes the following solutions: - **Software Resource Decoupling**: By introducing the concept of Serverless Computing, particularly the Function-as-a-Service (FaaS) model, the paper achieves software-level decoupling of resources in HPC systems. This approach allows users to access and utilize underutilized computational resources in a fine-grained manner without significant modifications to existing hardware or operating systems. - **HPC-Specialized FaaS Platform**: The paper introduces an FaaS platform specifically designed for HPC environments—rFaaS, which can improve resource utilization without changing existing hardware configurations. The rFaaS platform supports deploying functions to underutilized nodes, effectively utilizing idle CPU cores, memory, and GPU resources. - **Improved Resource Allocation Strategies**: To reduce resource contention between different applications, the paper also proposes a new resource co-location strategy. This strategy determines which tasks are suitable to be co-located on the same node by analyzing historical data and performance models. In summary, the paper aims to improve resource utilization in HPC systems through software resource decoupling, address the limitations of traditional resource allocation mechanisms, and reduce the cost and complexity of hardware solutions.