Abstract:Azure Cloud offers a wide range of resources for running HPC workloads, requiring users to configure their deployment by selecting VM types, number of VMs, and processes per VM. Suboptimal decisions may lead to longer execution times or additional costs for the user. We are developing an open-source tool to assist users in making these decisions by considering application input parameters, as they influence resource consumption. The tool automates the time-consuming process of setting up the cloud environment, executing the benchmarking runs, handling output, and providing users with resource selection recommendations as high level insights on run times and costs across different VM types and number of VMs. In this work, we present initial results and insights on reducing the number of cloud executions needed to provide such guidance, leveraging data analytics and optimization techniques with two well-known HPC applications: OpenFOAM and LAMMPS.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when running high - performance computing (HPC) workloads on the Azure cloud platform, how to optimize the execution time and cost. Specifically, users need to select the virtual machine (VM) type, the number of VMs, and the number of processes on each VM when configuring the deployment. If these selections are not optimized enough, it may lead to a longer execution time or additional costs. ### Core of the problem 1. **Complexity of resource selection**: Users need to select appropriate resources according to the application input parameters, which is especially difficult for non - IT experts. 2. **Balance between execution time and cost**: Users need to find the optimal resource configuration that can both minimize the execution time and control the cost. 3. **Insufficient data**: In many cases, users lack enough historical data to make effective resource selections. ### Solution To help users solve these problems, the author has developed an open - source tool, aiming to simplify HPC resource selection in the following ways: - **Automated setup**: Automatically complete tedious steps such as cloud environment setup, benchmark test execution, and output processing. - **Data analysis and optimization**: Use data analysis and optimization techniques to reduce the number of required cloud executions and provide high - level insights into the execution time and cost of different VM types and quantities. - **Prediction model**: Based on the existing data points, predict the execution time under different VM types and application input parameters, thereby reducing the number of unnecessary experiments. ### Application cases The tool has been tested with two well - known HPC applications - OpenFOAM and LAMMPS. The experimental results show that this tool can significantly reduce the number of scenarios that need to be executed while still being able to provide accurate resource selection suggestions. ### Formula representation During the optimization process, the Broyden - Fletcher - Goldfarb - Shanno (BFGS) method is used to optimize the scaling factor \( \alpha \) to minimize the deviation between the known data points and the predicted time. The objective function can be expressed as: \[ \min_{\alpha} \sum_{i = 1}^{n} \left( t_i^{\text{known}} - \alpha \cdot t_i^{\text{predicted}} \right)^2 \] where: - \( t_i^{\text{known}} \) is the known execution time, - \( t_i^{\text{predicted}} \) is the execution time predicted by linear interpolation, - \( \alpha \) is the optimized scaling factor. Through this method, the tool can provide users with efficient and economical resource configuration suggestions based on limited data.

Simplifying HPC resource selection: A tool for optimizing execution time and cost on Azure

HPCAdvisor: A Tool for Assisting Users in Selecting HPC Resources in the Cloud

P F ] 1 3 A ug 2 01 9 HPC AI 500 : A Benchmark Suite for HPC AI Systems

An SLA-based Advisor for Placement of HPC Jobs on Hybrid Clouds

Hedge Your Bets: Optimizing Long-term Cloud Costs by Mixing VM Purchasing Options

Machine Learning Algorithms for Active Monitoring of High Performance Computing as a Service (HPCaaS) Cloud Environments

Reproducible Workflow on a Public Cloud for Computational Fluid Dynamics

Acic: Automatic Cloud I/O Configurator For Hpc Applications

Online Resource Management in Thermal and Energy Constrained Heterogeneous High Performance Computing

Seeing Shapes in Clouds: On the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service

HPX with Spack and Singularity Containers: Evaluating Overheads for HPX/Kokkos using an astrophysics application

Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications

Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum

Automatic Cloud I/O Configurator for I/O Intensive Parallel Applications

Dynamic resource allocation for efficient parallel CFD simulations

Heterogeneous architectures for computational intensive applications: A cost-effectiveness analysis

Building Semi-Elastic Virtual Clusters for Cost-Effective HPC Cloud Resource Provisioning

FogROS2-Config: Optimizing Latency and Cost for Multi-Cloud Robot Applications

HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges

Reproducible and Portable Workflows for Scientific Computing and HPC in the Cloud

Pricing Schemes for Energy-Efficient HPC Systems: Design and Exploration