Scalable ATLAS pMSSM computational workflows using containerised REANA reusable analysis platform

Marco Donadoni,Matthew Feickert,Lukas Heinrich,Yang Liu,Audrius Mečionis,Vladyslav Moisieienkov,Tibor Šimko,Giordon Stark,Marco Vidal García

2024-03-06

Abstract:In this paper we describe the development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows. The project is looking to assess the global coverage of BSM physics and requires running O(5k) computational workflows representing pMSSM model points. Following ATLAS Analysis Preservation policies, many analyses have been preserved as containerised Yadage workflows, and after validation were added to a curated selection for the pMSSM study. To run the workflows at scale, we utilised the REANA reusable analysis platform. We describe how the REANA platform was enhanced to ensure the best concurrent throughput by internal service scheduling changes. We discuss the scalability of the approach on Kubernetes clusters from 500 to 5000 cores. Finally, we demonstrate a possibility of using additional ad-hoc public cloud infrastructure resources by running the same workflows on the Google Cloud Platform.

Distributed, Parallel, and Cluster Computing,High Energy Physics - Experiment

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently run a large number of ATLAS pMSSM (phenomenological Minimal Supersymmetric Standard Model) workflows in a large - scale parallel computing environment. Specifically, the project aims to evaluate the global coverage of physics beyond the Standard Model (BSM), and needs to run thousands of workflows to represent pMSSM model points. This involves using containerized computing workflows for large - scale parallel processing on the REANA reusable analysis platform to evaluate the reinterpretation of pMSSM in LHC (Large Hadron Collider) Run - 2 analysis. To achieve this goal, the researchers developed a streamlined framework for the reinterpretation of ATLAS pMSSM, which is based on the RECAST concept and takes into account the experience previously obtained during LHC Run - 1. By following the ATLAS analysis preservation policy, many ATLAS analyses are saved as containerized Yadage workflows and, after verification, are added to a carefully selected collection of pMSSM research analyses. In addition, the study also explored the possibility of using additional temporary public cloud infrastructure resources, such as running the same workflows on Google Cloud Platform. The paper describes in detail how to optimize the REANA platform to ensure the optimal concurrent throughput from 500 to 5,000 cores on a Kubernetes cluster, and how to handle various challenges in the workflow scheduling, execution, and termination processes. Through these improvements, the research team can effectively handle thousands of pMSSM workflows, thus promoting typical applications in pMSSM research.

Scalable ATLAS pMSSM computational workflows using containerised REANA reusable analysis platform

Scaling MadMiner with a deployment on REANA

Docker-Enabled Scalable Parallel MLFMA System for RCS Evaluation

Prototyping a ROOT-based distributed analysis workflow for HL-LHC: the CMS use case

Repurposing of the Run 2 CMS High Level Trigger Infrastructure as a Cloud Resource for Offline Computing

Survey of adaptive containerization architectures for HPC

Portability and Scalability Evaluation of Large-Scale Statistical Modeling and Prediction Software through HPC-Ready Containers

HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure

Reproducible and Portable Workflows for Scientific Computing and HPC in the Cloud

Reduce, Reuse, Reinterpret: an end-to-end pipeline for recycling particle physics results

A containerized analytics framework for data and compute-intensive pipeline applications

Design and Execution of make-like, distributed Analyses based on Spotify's Pipelining Package Luigi

End-to-End Analysis Automation over Distributed Resources with Luigi Analysis Workflows

Operational experience and R&D results using the Google Cloud for High-Energy Physics in the ATLAS experiment

Cloud-enabled Scalable Analysis of Large Proteomics Cohorts

CMSSW Scaling Limits on Many-Core Machines

Using Big Data Technologies for HEP Analysis

Monte Carlo modeling of Standard Model multi-boson production processes for $\sqrt{s} = 13$ TeV ATLAS analyses

High-Throughput Computing on High-Performance Platforms: A Case Study

Preservation of the Direct Photon and Neutral Meson Analysis in the PHENIX Experiment at RHIC

A Serverless Architecture for Efficient and Scalable Monte Carlo Markov Chain Computation