Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

Logan Ward,Ganesh Sivaraman,J. Gregory Pauloski,Yadu Babuji,Ryan Chard,Naveen Dandu,Paul C. Redfern,Rajeev S. Assary,Kyle Chard,Larry A. Curtiss,Rajeev Thakur,Ian Foster
DOI: https://doi.org/10.1109/MLHPC54614.2021.00007
2021-10-06
Abstract:Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.
Distributed, Parallel, and Cluster Computing,Materials Science,Machine Learning
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the problem of how to effectively design and execute large-scale simulation ensembles in high-performance computing (HPC) environments, especially under conditions of limited resources and vast search spaces. Specifically, it proposes an open-source Python framework named Colmena, which aims to guide these ensemble simulations through machine learning (ML) methods to optimize resource allocation and experiment selection. The main goal of the Colmena framework is to simplify user control over complex task sequences, including the selection and scheduling of simulations and learning tasks, result collection, and the invocation and retraining of ML models. It utilizes the Parsl library to execute these tasks on HPC systems, handling task distribution, result aggregation, model invocation, and retraining processes. The paper particularly emphasizes the application of Colmena in the field of electrolyte design, demonstrating that it can scale up to 65,536 CPU cores and increase the discovery rate of high-performance molecules by over 100 times compared to unguided search methods. The paper first outlines the need for accelerated ensemble simulations in scientific applications, highlighting the importance of experimental design methods in selecting optimal simulations, especially those based on machine learning. It then details the design principles and implementation specifics of Colmena, including its architecture, communication mechanisms, and performance evaluation strategies. Finally, it showcases the practical application of Colmena through an example of electrolyte design, validating its scalability and acceleration performance in HPC environments. Additionally, the paper analyzes Colmena's scalability on the Cray XC40 system and provides specific application cases in molecular material design.