Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis

Johannes P. Blaschke,Aaron S. Brewster,Daniel W. Paley,Derek Mendez,Asmit Bhowmick,Nicholas K. Sauter,Wilko Kröger,Murali Shankar,Bjoern Enders,Deborah Bard
DOI: https://doi.org/10.1002/cpe.8019
2024-01-01
Abstract:X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSC's Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiment's operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends.
Distributed, Parallel, and Cluster Computing,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
The paper primarily addresses the issue of real-time data analysis in X-ray Free Electron Laser (XFEL) experiments. Specifically, the research team tackled two major challenges through the following methods: 1. **The need for rapid decision-making**: Due to the high operational costs of XFEL facilities, experimenters need to quickly transition from data acquisition to data analysis to timely adjust experimental plans. 2. **The rapid growth in data collection rates**: With technological advancements, the speed of data collection is growing exponentially, necessitating new scalable algorithms to handle this data. To address these issues, researchers conducted two experiments (LV95 and P175) in September 2020 and established a "superfacility" mode between the Linac Coherent Light Source (LCLS) at the SLAC National Accelerator Laboratory and the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL). This mode achieved the following functionalities: - **Automatic data transfer**: Raw data generated by the experiments is automatically transferred from LCLS to the Cori XC40 computing system at NERSC for analysis. - **Real-time data analysis**: Using the CCTBX software package, real-time analysis from data acquisition to molecular structure reconstruction can be completed in as fast as 10 minutes, allowing experimenters to make decisions before the next round of data collection. - **Matching data analysis rates**: Through automated job submission and data management, the speed of data analysis can match the speed of data collection, which is a significant milestone in XFEL experiments. Additionally, the research discusses overcoming some challenges in high-performance computing, such as imbalanced data processing times and bursty computational demands, and proposes an "urgent and real-time computing" approach to optimize resource allocation. Through these methods, the paper demonstrates how to effectively handle large-scale data in XFEL experiments and provides important references for future experiments.