NSF RESUME HPC Workshop: High-Performance Computing and Large-Scale Data Management in Service of Epidemiological Modeling

Abby Stevens,Jonathan Ozik,Kyle Chard,Jaline Gerardin,Justin M. Wozniak
2023-08-09
Abstract:The NSF-funded Robust Epidemic Surveillance and Modeling (RESUME) project successfully convened a workshop entitled "High-performance computing and large-scale data management in service of epidemiological modeling" at the University of Chicago on May 1-2, 2023. This was part of a series of workshops designed to foster sustainable and interdisciplinary co-design for predictive intelligence and pandemic prevention. The event brought together 31 experts in epidemiological modeling, high-performance computing (HPC), HPC workflows, and large-scale data management to develop a shared vision for capabilities needed for computational epidemiology to better support pandemic prevention. Through the workshop, participants identified key areas in which HPC capabilities could be used to improve epidemiological modeling, particularly in supporting public health decision-making, with an emphasis on HPC workflows, data integration, and HPC access. The workshop explored nascent HPC workflow and large-scale data management approaches currently in use for epidemiological modeling and sought to draw from approaches used in other domains to determine which practices could be best adapted for use in epidemiological modeling. This report documents the key findings and takeaways from the workshop.
Computers and Society
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve epidemiological modeling through high - performance computing (HPC) and large - scale data management, especially to support public health decision - making. Specifically, the paper explores the following key areas: 1. **Evaluating current computational resources for epidemiological modeling**: - **Advantages**: Many researchers can access large - scale computational resources and are able to effectively utilize local and cloud computing resources for efficient computing. These systems usually have large storage capacities and can handle the amount of data required for large - scale epidemiological modeling. Some research teams have also developed specialized software tools to interact with these systems. - **Disadvantages**: The data locality problem affects the efficiency of data analysis in some settings, while other systems face limitations due to job constraints or the use of multiple versions of job - scheduling tools. Switching from one HPC system to another usually encounters difficulties, mainly due to the understanding of performance differences, the management of technical debt, and the lack of appropriate documentation and training. Many users report difficulties in efficiently using resources and adjusting the system for optimal performance. In addition, areas that could benefit from automation, such as hyper - parameter checking and model calibration, usually require manual handling, increasing the time and complexity of the modeling process. - **Capabilities in an ideal world**: In an ideal world, a language - independent and highly automated system would allow for easy collaboration, focusing on science rather than computing. This system would include a comprehensive model library, accompanied by code, documentation, and parameter repositories. It would promote model reuse, providing a pre - rated model catalog to meet different needs. Key functions of the system would include modular problem packaging, a series of rapidly deployable calibration models, and clear parameter sources. The system would also promote model comparison, integrate HPC - friendly calibration capabilities, and provide shared - result visualization tools. Automation would handle error checking, task termination and retries, and provide a feedback loop for performance evaluation. High user - friendliness would allow for easy extraction of detailed modeling outputs and the ability to switch between different model formulations. Other attributes would include code containerization, an interactive cloud system, continuous integration for HPC resources, and a software abstraction layer for flexibility and extensive testing. 2. **Evaluating current data practices**: - **Advantages**: Many current practices demonstrate the effective use and management of diverse data sets. Public data is often used, and there are some well - structured practices for cleaning and preparing this data for further processing. For sensitive data, secure enclaves have been developed to store and analyze data while protecting privacy and confidentiality. Other data practices include establishing cooperation agreements to access proprietary data. Automated processes have been established to handle routine data tasks, such as downloading. Access to HPC resources and dedicated queues is considered an advantage. There are also some promising efforts to develop better data tools, such as real - time data summaries and quality - checking systems. - **Disadvantages**: Data localization is a major problem because data is usually too large or too sensitive to move, which may impede access and computing. Some parts of the data pipeline lack automation, increasing manual labor and potential errors. Several participants reported the need for more transparent and user - friendly tools and simple methods to explore data for model development. Problems related to tracking data and model parameter sources, data cleaning, and interpretation are common. Data sharing is also a problem, with privacy issues and academic competition often being limiting factors. - **Capabilities in an ideal world**: In an ideal world, data storage and access would be secure, seamless, and require minimal user attention while adapting to specific requirements, such as protected health information. Intentional databases would simplify data management, and the cleaning, validation, and quality - control processes would be automated to improve efficiency. Model outputs would be standardized, tracked, and made public in a FAIR (Findable, Accessible, Interoperable, Reusable) manner, allowing different stakeholders to reuse data. Standardized data APIs would facilitate the coordination of data from various sources, and automated data retrieval would accelerate access speed. Clear version control and provenance would ensure data reliability. The system would make data available where needed, whether stored in a database or as a flat file. APIs would manage access to protected data according to user permissions. Interaction with HPC resources would become easier, with low - barrier data preparation, alternatives to the file system, simplified authentication, and the ability to visualize HPC outputs.