HydroNet: Benchmark Tasks for Preserving Intermolecular Interactions and Structural Motifs in Predictive and Generative Models for Molecular Data

Sutanay Choudhury,Jenna A. Bilbrey,Logan Ward,Sotiris S. Xantheas,Ian Foster,Joseph P. Heindel,Ben Blaiszik,Marcus E. Schwarting
DOI: https://doi.org/10.48550/arXiv.2012.00131
2020-12-01
Abstract:Intermolecular and long-range interactions are central to phenomena as diverse as gene regulation, topological states of quantum materials, electrolyte transport in batteries, and the universal solvation properties of water. We present a set of challenge problems for preserving intermolecular interactions and structural motifs in machine-learning approaches to chemical problems, through the use of a recently published dataset of 4.95 million water clusters held together by hydrogen bonding interactions and resulting in longer range structural patterns. The dataset provides spatial coordinates as well as two types of graph representations, to accommodate a variety of machine-learning practices.
Machine Learning,Chemical Physics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to maintain intermolecular interactions and structural motifs in predictive and generative models of molecular data. Specifically, the paper focuses on how to simulate and generate hydrogen bond networks and their long-range structural patterns in water clusters using machine learning methods. ### Main Issues 1. **Molecular Property Prediction**: Given the specific spatial coordinate information or bonding structure of a water cluster, predict its energy. 2. **Molecular Generation**: Given a certain number of water molecules, generate candidate structures that conform to low-energy configuration characteristics. ### Scientific Motivation - Water clusters are discrete networks formed by water molecules connected through hydrogen bonds. - Although most interactions are short-range (i.e., interactions between neighboring molecules), there are also significant many-body, long-range interactions (i.e., interactions with next-nearest neighbors and beyond). - Understanding these many-body and long-range hydrogen bond interactions is crucial for addressing long-standing scientific questions about the macroscopic properties of liquid water, ice, and aqueous solutions (such as specific heat capacity, density, dielectric constant, compressibility). - These interactions play a key role in the bulk and interfacial properties of liquid water and ice, as well as in solvation processes, which are important for applications such as drug delivery, protein folding, quantum material design, and the design of novel battery electrolytes. ### Key Challenges - A notable feature of water clusters is that many different structures can have very similar energies. - For a given spatial orientation of oxygen atoms (oxygen network), there are multiple hydrogen bond networks (coarse-grained graphs) based on the arrangement of hydrogen atoms. - The structural characteristics of low-energy hydrogen bond networks (such as degree distribution, shortest path length, polygon distribution, etc.) systematically change with cluster size. - Generative methods should consider these characteristics because clusters far from the distribution typically have higher energy and are therefore not of interest. ### Dataset Description - The dataset contains the lowest energy configurations of 4.95 million water clusters, each consisting of 3 to 30 water molecules. - The dataset provides spatial coordinates and two types of graph representations: atomic interaction graphs (capturing intra- and intermolecular bonding patterns) and coarse-grained graphs (capturing only the intermolecular structure of the cluster). ### Machine Learning Tasks 1. **Cluster Potential Energy Prediction Task**: - Predict the potential energy of water clusters given the known structural geometry (geometry to energy). - Predict potential energy solely from the connectivity of water molecules (graph to energy). 2. **Molecular Generation Task with Structural Feature Preservation**: - Given a certain number of water molecules, generate geometric/atomic/coarse-grained representations that satisfy specific graph-theoretic structural features and minimize cluster energy by optimizing the relative spatial arrangement of atoms and molecules. ### Summary By proposing a series of challenging problems and corresponding datasets, the paper aims to advance the application of machine learning to molecular data, particularly in maintaining intermolecular interactions and structural motifs. These tasks not only help to understand the complex structures of water clusters but also provide a foundation for developing new machine learning methods.