Abstract:Aggregated HPC resources have rigid allocation systems and programming models which struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this paper, we improve the utilization of supercomputers by employing the new cloud paradigm of serverless computing. We show how serverless functions provide fine-grained access to the resources of batch-managed cluster nodes. We present an HPC-oriented Function-as-a-Service (FaaS) that satisfies the requirements of high-performance applications. We demonstrate a software resource disaggregation approach where placing functions on unallocated and underutilized nodes allows idle cores and accelerators to be utilized while retaining near-native performance.

What problem does this paper attempt to address?

The paper primarily addresses the issue of underutilization of resources in High-Performance Computing (HPC) systems and proposes solutions to tackle this problem. Specifically, the paper focuses on the following core issues: 1. **Underutilization of Resources**: Modern HPC systems commonly face the problem of underutilized resources, especially in terms of memory and computational resources. Research indicates that even in top-tier supercomputing systems, a significant amount of memory remains underutilized, and compute nodes can be temporarily idle at certain times. 2. **Limitations of Static Resource Allocation Mechanisms**: Traditional HPC resource allocation mechanisms use static batch allocation methods, which are difficult to adapt to varying and diverse workloads, leading to inefficient resource utilization. 3. **Cost and Complexity of Hardware Solutions**: Although hardware-level resource decoupling technologies can improve resource utilization, these methods usually require expensive hardware investments and complex system reconfigurations. To address these issues, the paper proposes the following solutions: - **Software Resource Decoupling**: By introducing the concept of Serverless Computing, particularly the Function-as-a-Service (FaaS) model, the paper achieves software-level decoupling of resources in HPC systems. This approach allows users to access and utilize underutilized computational resources in a fine-grained manner without significant modifications to existing hardware or operating systems. - **HPC-Specialized FaaS Platform**: The paper introduces an FaaS platform specifically designed for HPC environments—rFaaS, which can improve resource utilization without changing existing hardware configurations. The rFaaS platform supports deploying functions to underutilized nodes, effectively utilizing idle CPU cores, memory, and GPU resources. - **Improved Resource Allocation Strategies**: To reduce resource contention between different applications, the paper also proposes a new resource co-location strategy. This strategy determines which tasks are suitable to be co-located on the same node by analyzing historical data and performance models. In summary, the paper aims to improve resource utilization in HPC systems through software resource decoupling, address the limitations of traditional resource allocation mechanisms, and reduce the cost and complexity of hardware solutions.

Software Resource Disaggregation for HPC with Serverless Computing

rFaaS: Enabling High Performance Serverless with RDMA and Leases

Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk

A Case For Intra-rack Resource Disaggregation in HPC

Cppless: Productive and Performant Serverless Programming in C++

QoS-aware offloading policies for serverless functions in the Cloud-to-Edge continuum

A framework for offloading and migration of serverless functions in the Edge-Cloud Continuum

Orchestrating the Execution of Serverless Functions in Hybrid Clouds

Heterogeneity-aware Proactive Elastic Resource Allocation for Serverless Applications

Intelligent colocation of HPC workloads

In Serverless, OS Scheduler Choice Costs Money: A Hybrid Scheduling Approach for Cheaper FaaS

Function Delivery Network: Extending Serverless Computing for Heterogeneous Platforms

Serverless Computing Based on Dynamic-Addressable Session

GreenFaaS: Maximizing Energy Efficiency of HPC Workloads with FaaS

Spatio-Temporal Resource Meshes for Serverless Computing

SFSM: A Serverless Function Scheduling Method for FaaS Applications over Edge Computing

Data-driven scheduling in serverless computing to reduce response time

FaaSGraph: Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless Computing

LaSS: Running Latency Sensitive Serverless Computations at the Edge

In-Storage Domain-Specific Acceleration for Serverless Computing

Rethinking High Performance Computing Platforms: Challenges, Opportunities and Recommendations