Simplifying HPC resource selection: A tool for optimizing execution time and cost on Azure

Marco A. S. Netto,Wolfgang De Savador,Davide Vanzo
2024-12-03
Abstract:Azure Cloud offers a wide range of resources for running HPC workloads, requiring users to configure their deployment by selecting VM types, number of VMs, and processes per VM. Suboptimal decisions may lead to longer execution times or additional costs for the user. We are developing an open-source tool to assist users in making these decisions by considering application input parameters, as they influence resource consumption. The tool automates the time-consuming process of setting up the cloud environment, executing the benchmarking runs, handling output, and providing users with resource selection recommendations as high level insights on run times and costs across different VM types and number of VMs. In this work, we present initial results and insights on reducing the number of cloud executions needed to provide such guidance, leveraging data analytics and optimization techniques with two well-known HPC applications: OpenFOAM and LAMMPS.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when running high - performance computing (HPC) workloads on the Azure cloud platform, how to optimize the execution time and cost. Specifically, users need to select the virtual machine (VM) type, the number of VMs, and the number of processes on each VM when configuring the deployment. If these selections are not optimized enough, it may lead to a longer execution time or additional costs. ### Core of the problem 1. **Complexity of resource selection**: Users need to select appropriate resources according to the application input parameters, which is especially difficult for non - IT experts. 2. **Balance between execution time and cost**: Users need to find the optimal resource configuration that can both minimize the execution time and control the cost. 3. **Insufficient data**: In many cases, users lack enough historical data to make effective resource selections. ### Solution To help users solve these problems, the author has developed an open - source tool, aiming to simplify HPC resource selection in the following ways: - **Automated setup**: Automatically complete tedious steps such as cloud environment setup, benchmark test execution, and output processing. - **Data analysis and optimization**: Use data analysis and optimization techniques to reduce the number of required cloud executions and provide high - level insights into the execution time and cost of different VM types and quantities. - **Prediction model**: Based on the existing data points, predict the execution time under different VM types and application input parameters, thereby reducing the number of unnecessary experiments. ### Application cases The tool has been tested with two well - known HPC applications - OpenFOAM and LAMMPS. The experimental results show that this tool can significantly reduce the number of scenarios that need to be executed while still being able to provide accurate resource selection suggestions. ### Formula representation During the optimization process, the Broyden - Fletcher - Goldfarb - Shanno (BFGS) method is used to optimize the scaling factor \( \alpha \) to minimize the deviation between the known data points and the predicted time. The objective function can be expressed as: \[ \min_{\alpha} \sum_{i = 1}^{n} \left( t_i^{\text{known}} - \alpha \cdot t_i^{\text{predicted}} \right)^2 \] where: - \( t_i^{\text{known}} \) is the known execution time, - \( t_i^{\text{predicted}} \) is the execution time predicted by linear interpolation, - \( \alpha \) is the optimized scaling factor. Through this method, the tool can provide users with efficient and economical resource configuration suggestions based on limited data.