Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

Logan Ward,Ganesh Sivaraman,J. Gregory Pauloski,Yadu Babuji,Ryan Chard,Naveen Dandu,Paul C. Redfern,Rajeev S. Assary,Kyle Chard,Larry A. Curtiss,Rajeev Thakur,Ian Foster

DOI: https://doi.org/10.1109/MLHPC54614.2021.00007

2021-10-06

Abstract:Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.

Distributed, Parallel, and Cluster Computing,Materials Science,Machine Learning

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing the problem of how to effectively design and execute large-scale simulation ensembles in high-performance computing (HPC) environments, especially under conditions of limited resources and vast search spaces. Specifically, it proposes an open-source Python framework named Colmena, which aims to guide these ensemble simulations through machine learning (ML) methods to optimize resource allocation and experiment selection. The main goal of the Colmena framework is to simplify user control over complex task sequences, including the selection and scheduling of simulations and learning tasks, result collection, and the invocation and retraining of ML models. It utilizes the Parsl library to execute these tasks on HPC systems, handling task distribution, result aggregation, model invocation, and retraining processes. The paper particularly emphasizes the application of Colmena in the field of electrolyte design, demonstrating that it can scale up to 65,536 CPU cores and increase the discovery rate of high-performance molecules by over 100 times compared to unguided search methods. The paper first outlines the need for accelerated ensemble simulations in scientific applications, highlighting the importance of experimental design methods in selecting optimal simulations, especially those based on machine learning. It then details the design principles and implementation specifics of Colmena, including its architecture, communication mechanisms, and performance evaluation strategies. Finally, it showcases the practical application of Colmena through an example of electrolyte design, validating its scalability and acceleration performance in HPC environments. Additionally, the paper analyzes Colmena's scalability on the Cray XC40 system and provides specific application cases in molecular material design.

Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

Employing artificial intelligence to steer exascale workflows with colmena

Large Scale Numerical Simulation Via Parallelization and Reconfigurable Computing Hardware

Portable, heterogeneous ensemble workflows at scale using libEnsemble

Docker-Enabled Scalable Parallel MLFMA System for RCS Evaluation

In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD

Machine Learning for Performance Enhancement of Molecular Dynamics Simulations

Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes

Integrating Machine Learning with HPC-driven Simulations for Enhanced Student Learning

CASTELO: Clustered Atom Subtypes aidEd Lead Optimization -- a combined machine learning and molecular modeling method

Implementing dynamic high-performance computing supported workflows on Scanning Transmission Electron Microscope

Scalable Algorithms for Molecular Dynamics Simulations on Commodity Clusters

LeapFrog: Getting the Jump on Multi-Scale Materials Simulations Using Machine Learning

Integrating ytopt and libEnsemble to autotune OpenMC

Accelerating Computational Materials Discovery with Machine Learning and Cloud High-Performance Computing: from Large-Scale Screening to Experimental Validation

Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures

Drug Design in the Exascale Era: A Perspective from Massively Parallel QM/MM Simulations

Massively parallel modeling of electromagnetic field in conductive media: An MPI-CUDA implementation on Multi-GPU computers

Machine-learning-based dynamic-importance sampling for adaptive multiscale simulations

Redesigning OpenKMC for Multi-Component Trillion-Atom Simulations on the New Sunway Supercomputer