ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies
Matteo Turilli,Mihael Hategan-Marandiuc,Mikhail Titov,Ketan Maheshwari,Aymen Alsaadi,Andre Merzky,Ramon Arambula,Mikhail Zakharchanka,Matt Cowan,Justin M. Wozniak,Andreas Wilke,Ozgur Ozan Kilic,Kyle Chard,Rafael Ferreira da Silva,Shantenu Jha,Daniel Laney
2024-07-24
Abstract:Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.
Software Engineering,Distributed, Parallel, and Cluster Computing