Scalable ATLAS pMSSM computational workflows using containerised REANA reusable analysis platform

Marco Donadoni,Matthew Feickert,Lukas Heinrich,Yang Liu,Audrius Mečionis,Vladyslav Moisieienkov,Tibor Šimko,Giordon Stark,Marco Vidal García
2024-03-06
Abstract:In this paper we describe the development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows. The project is looking to assess the global coverage of BSM physics and requires running O(5k) computational workflows representing pMSSM model points. Following ATLAS Analysis Preservation policies, many analyses have been preserved as containerised Yadage workflows, and after validation were added to a curated selection for the pMSSM study. To run the workflows at scale, we utilised the REANA reusable analysis platform. We describe how the REANA platform was enhanced to ensure the best concurrent throughput by internal service scheduling changes. We discuss the scalability of the approach on Kubernetes clusters from 500 to 5000 cores. Finally, we demonstrate a possibility of using additional ad-hoc public cloud infrastructure resources by running the same workflows on the Google Cloud Platform.
Distributed, Parallel, and Cluster Computing,High Energy Physics - Experiment
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently run a large number of ATLAS pMSSM (phenomenological Minimal Supersymmetric Standard Model) workflows in a large - scale parallel computing environment. Specifically, the project aims to evaluate the global coverage of physics beyond the Standard Model (BSM), and needs to run thousands of workflows to represent pMSSM model points. This involves using containerized computing workflows for large - scale parallel processing on the REANA reusable analysis platform to evaluate the reinterpretation of pMSSM in LHC (Large Hadron Collider) Run - 2 analysis. To achieve this goal, the researchers developed a streamlined framework for the reinterpretation of ATLAS pMSSM, which is based on the RECAST concept and takes into account the experience previously obtained during LHC Run - 1. By following the ATLAS analysis preservation policy, many ATLAS analyses are saved as containerized Yadage workflows and, after verification, are added to a carefully selected collection of pMSSM research analyses. In addition, the study also explored the possibility of using additional temporary public cloud infrastructure resources, such as running the same workflows on Google Cloud Platform. The paper describes in detail how to optimize the REANA platform to ensure the optimal concurrent throughput from 500 to 5,000 cores on a Kubernetes cluster, and how to handle various challenges in the workflow scheduling, execution, and termination processes. Through these improvements, the research team can effectively handle thousands of pMSSM workflows, thus promoting typical applications in pMSSM research.