Machine Learning for Columnar High Energy Physics Analysis

Elliott Kauffman,Alexander Held,Oksana Shadura
2024-01-04
Abstract:Machine learning (ML) has become an integral component of high energy physics data analyses and is likely to continue to grow in prevalence. Physicists are incorporating ML into many aspects of analysis, from using boosted decision trees to classify particle jets to using unsupervised learning to search for physics beyond the Standard Model. Since ML methods have become so widespread in analysis and these analyses need to be scaled up for HL-LHC data, neatly integrating ML training and inference into scalable analysis workflows will improve the user experience of analysis in the HL-LHC era. We present the integration of ML training and inference into the IRIS-HEP Analysis Grand Challenge (AGC) pipeline to provide an example of how this integration can look like in a realistic analysis environment. We also utilize Open Data to ensure the project's reach to the broader community. Different approaches for performing ML inference at analysis facilities are investigated and compared, including performing inference through external servers. Since ML techniques are applied for many different types of tasks in physics analyses, we showcase options for ML integration that can be applied to various inference needs.
High Energy Physics - Experiment
What problem does this paper attempt to address?
The problem addressed in this paper is how to effectively integrate machine learning (ML) into the columnar analysis workflow in high-energy physics data analysis, in order to meet the challenges of data processing in the era of High-Luminosity Large Hadron Collider (HL-LHC). Traditionally, high-energy physicists rely on compiled languages such as C++ for fast processing. However, with the increase in data volume and event size, machine learning has become a promising approach to solve these problems. The paper proposes a method to integrate ML training and inference into the IRIS-HEP Analysis Grand Challenge (AGC) pipeline, providing an example of how to achieve such integration in a practical analysis environment, and leveraging open data to amplify the project's impact. The paper also investigates different methods for performing ML inference in analysis facilities and compares their performance to accommodate the requirements of different types of analysis tasks.