Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs

Daniil A. Boiko,Thiago Reschützegger,Benjamin Sanchez-Lengeling,Samuel M. Blau,Gabe Gomes
2024-08-08
Abstract:Molecular representation is a foundational element in our understanding of the physical world. Its importance ranges from the fundamentals of chemical reactions to the design of new therapies and materials. Previous molecular machine learning models have employed strings, fingerprints, global features, and simple molecular graphs that are inherently information-sparse representations. However, as the complexity of prediction tasks increases, the molecular representation needs to encode higher fidelity information. This work introduces a novel approach to infusing quantum-chemical-rich information into molecular graphs via stereoelectronic effects. We show that the explicit addition of stereoelectronic interactions significantly improves the performance of molecular machine learning models. Furthermore, stereoelectronics-infused representations can be learned and deployed with a tailored double graph neural network workflow, enabling its application to any downstream molecular machine learning task. Finally, we show that the learned representations allow for facile stereoelectronic evaluation of previously intractable systems, such as entire proteins, opening new avenues of molecular design.
Machine Learning,Chemical Physics
What problem does this paper attempt to address?
The paper aims to address the limitations of molecular representation methods in machine learning applications, especially when it comes to complex prediction tasks. Specifically, the research team proposes a novel approach—Stereoelectronics-Infused Molecular Graphs (SIMGs)—to enhance the performance of molecular machine learning models. ### Research Background and Motivation Traditionally, molecular representation methods such as molecular graphs and fingerprints have played a significant role in the fundamental understanding of chemical reactions, new therapies, and material design. However, these methods are inherently sparse in information and have limitations when dealing with complex molecular property prediction tasks. As the complexity of prediction tasks increases, molecular representations need to encode higher precision information. ### Solution To address this issue, the researchers introduced SIMGs, a new molecular graph representation method based on Natural Bond Orbital (NBO) analysis. This method enriches the information content of molecular graphs by explicitly adding stereoelectronic effects. This approach not only significantly improves the performance of molecular machine learning models but can also be learned and deployed to any downstream molecular machine learning tasks. ### Main Contributions 1. **Construction of SIMGs**: SIMGs are constructed by extracting information from NBO analysis data, including bond orbitals, lone pairs, and their interactions, thereby better capturing the 3D quantum characteristics of molecules. 2. **Approximation of SIMG***: To overcome the time constraints of NBO calculations and their inapplicability to large molecules such as proteins, the researchers developed an approximate representation method based on Graph Neural Networks (GNNs) called SIMG*, which can quickly predict SIMGs with only the 3D structure of the molecule as input. 3. **Model Performance Evaluation**: The effectiveness and generalization ability of SIMGs and SIMG* were validated through molecular property prediction tasks on the QM9 dataset and the discovery of stereoelectronic effects in proteins. 4. **Active Learning Strategy**: An active learning method was used to select training data by estimating the uncertainty of model predictions to guide the data selection process, especially when dealing with large datasets with high chemical diversity. Through the above work, the researchers demonstrated the potential of SIMGs and SIMG* as efficient molecular representation methods in downstream tasks, particularly in handling large molecular systems that were previously difficult to address.