AI-coupled HPC Workflow Applications, Middleware and Performance

Wes Brewer,Ana Gainaru,Frédéric Suter,Feiyi Wang,Murali Emani,Shantenu Jha
2024-06-20
Abstract:AI integration is revolutionizing the landscape of HPC simulations, enhancing the importance, use, and performance of AI-driven HPC workflows. This paper surveys the diverse and rapidly evolving field of AI-driven HPC and provides a common conceptual basis for understanding AI-driven HPC workflows. Specifically, we use insights from different modes of coupling AI into HPC workflows to propose six execution motifs most commonly found in scientific applications. The proposed set of execution motifs is by definition incomplete and evolving. However, they allow us to analyze the primary performance challenges underpinning AI-driven HPC workflows. We close with a listing of open challenges, research issues, and suggested areas of investigation including the the need for specific benchmarks that will help evaluate and improve the execution of AI-driven HPC workflows.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to couple artificial intelligence (AI) with high - performance computing (HPC) workflows to significantly improve the efficiency and performance of scientific computing. Specifically, the paper aims to: 1. **Improve the effective performance of HPC workflows**: By coupling the AI system to HPC simulations in real - time, enabling AI to guide or influence HPC tasks and vice versa, thereby enhancing the efficiency of "scientific research with given computing resources". 2. **Overcome the limitations of traditional forward simulations**: AI - coupled HPC workflows can overcome the bottlenecks of traditional simulations in physical and time scales or at higher resolutions in a more complex and extensive way. 3. **Provide sustainable and scalable performance gains**: By integrating AI into the computing workflow, significant performance improvements can be obtained without relying on specific processor architectures. 4. **Propose execution motifs**: By analyzing the applications of AI - HPC coupling in different modes, six of the most common execution motifs are identified, which help to understand the main performance challenges of AI - driven HPC workflows. 5. **Discuss open research questions and challenges**: This includes the need for specific benchmark tests to evaluate and improve the execution of AI - driven HPC workflows, and how to deal with the coupling problem between large - scale AI model training and HPC simulations. ### Six Execution Motifs 1. **AI - based Steering Ensembles of Simulations** - **Interaction mode**: Data flows from HPC to AI, and control flows from AI to HPC. - **Coupling mode**: Near - real - time analysis, dynamically generate or terminate simulations. - **Scope**: Use AI inference to improve HPC components online. 2. **Multistage Pipeline** - **Interaction mode**: Data flows from one HPC stage to multiple AI analysis components, and AI controls the execution of the next HPC stage. - **Coupling mode**: Near - real - time interaction, but the process usually does not dynamically generate or terminate. - **Scope**: Mainly used for branching the execution of HPC to optimize results. 3. **Inverse Design** - **Interaction mode**: Multiple HPC simulations send data to AI, and AI controls by iteratively optimizing design parameters. - **Coupling mode**: Static, concurrent or asynchronous execution, usually within the same system. - **Scope**: Improve both AI and HPC components simultaneously. 4. **Digital Replica** - **Interaction mode**: Data and control flow bidirectionally, combining experiments and HPC with AI. - **Coupling mode**: Real - time requirements, monitoring and visualization, and there may be scientists involved in adjusting future runs. - **Scope**: AI and HPC improve each other. 5. **Distributed Models and Dynamic Data** - **Interaction mode**: Data and control flow bidirectionally, involving multiple HPC and AI components. - **Coupling mode**: Geographically distributed, adapting to wide - area network performance. - **Scope**: AI is used to compress, reconstruct or remotely visualize dynamic data. 6. **Adaptive Training** - **Interaction mode**: Data flows from HPC to multiple AIs, and control flows from analysis to HPC and AI training. - **Coupling mode**: Near - real - time requirements, dynamic combination. - **Scope**: HPC is used to improve AI models. ### Summary By identifying and describing these execution motifs, the paper aims to provide a common conceptual basis for AI - HPC - coupled workflows, help understand their main performance challenges, and point the way for future research and development.